In [1]:
# Load scripts without installing
import sys, os.path
sys.path.insert(0, os.path.abspath(".."))

# Usage of Knox test

This notebook accompanies code which performs the Knox test for space-time clustering. This test is concerned with establishing the existence of space-time association within a set of events taking place in space and time - that is, the tendency of events which are close in space to also be close in time - and quantifying the level of association. The test is in common use within the study of crime, in which context it is tpically used to demonstrate that crime risk is communicable across space; that is, that the vicinity of a recently victimised location is at disproportionate risk of further victimisation for some period after the first event. This is often manifested in the form of 'near repeat' victimisation.

The test was first introduced by Knox (1964) in the context of epidemiology, and has subsequently become widely used within spatial statistics. Several variants of the test, in terms of their technical details, have been proposed. The version implemented here is essentially that upon which the Near Repeat Calculator is based, the details of which are explained in detail by Johnson *et al.* (2007).

This notebook demonstrates the use of the test as implemented here. The test is applied to a set of open crime data, and its output is examined. As will be explained, various aspects of the test can be configured according to need.

In [2]:
import pandas as pd
import numpy as np
import stats.knox as kx

## Example data

This example will be based on data from the [Crime Open Database (CODE)](https://osf.io/zyaqn/), which is a repository of open-source crime data from a number of sources, maintained by Matt Ashby. In this example, data from Chicago for residential burglary (breaking and entering) in 2016 will be used.

In this notebook, we will use a pre-processed version of this data, saved as `chicago_burglary_2014_2017.csv`. The pre-processing, details of which can be found in the [accompanying notebook](Prepare%20example%20data.ipynb), simply re-projects the data into metric coordinates.

In the dataset, `x` and `y` refer to spatial coordinates, while `date_single` refers to the date of the event. The data takes the following form:

In [3]:
data = pd.read_csv("../data/chicago_burglary_2014_2017.csv", 
                   parse_dates=['date_single'], 
                   dayfirst=True)

data = data[data['date_single'].between('2016-01-01', '2016-12-31')]
data.head()

Unnamed: 0,x,y,date_single
23192,350353.394957,578924.252858,2016-01-01
23193,355691.341195,586715.567123,2016-01-01
23194,353206.944625,584361.596673,2016-01-01
23195,352692.117119,567851.425945,2016-01-01
23196,357290.640106,571106.074148,2016-01-01


## The Knox test

The test itself is implemented via the function `knox_test()`. The docstring provides full details for all arguments, but a basic call to the function requires 5 to be specified:

- `xy`: Vector of x-y coordinates, of shape (N, 2)
- `t`: Vector of event times of shape (N,), measured in any units, relative to an origin timestamp
- `s_bands`: Sequence of upper limits for the spatial intervals of interest
- `t_bands`: Sequence of upper limits for the temporal intervals of interest, in the same units as `t`
- `n_iter`: Number of permutations to be performed in the course of the test

The spatio-temporal event data should be provided as two vectors - these can either be NumPy arrays or Pandas objects. They should contain the same number of rows, and the cases should be aligned.

For the temporal data, it is necessary to convert timestamps to numerical units, essentially measuring the time since an (arbitrary) origin point. In the example below, `t` is derived as the number of days since '2016-01-01'.

In [4]:
xy = data[['x','y']]
t = (data['date_single'] - pd.to_datetime('2016-01-01')).dt.days

The **s_bands** and **t_bands** arguments both specify the upper limits for each of the specified bands. The default behaviour is that the bands are closed on the right side, and open on the left side - though this can be changed via the `interval_side` argument. The exception to this is the first interval in each dimension, which always has a closed limit of 0 on the left side.

If **three bands** of **200 metres** each are required, for example, this can be achieved as follows:

In [5]:
s_bands = [200, 400, 600]

In the study of crime, it is common to treat **'exact repeats'** (i.e. incidents occurring at exactly the same location) as a separate case. This is typically operationalised by using a small tolerance value which constitutes a margin of error, such as 0.1 metres. To achieve this, the above could be adapted to:

In [6]:
s_bands = [0.1, 200, 400, 600]

The setting of temporal limits is exactly analogous; to set up **three bands** of **7 days** each, the following would be used:

In [7]:
t_bands = [7, 14, 21]

In general, any set of bands can be used, as long as they are sorted in ascending order.

The Knox test can then be performed by applying it to these values, along with the required number of iterations. Here, for reproducibility, we also set the random `seed` value, though this would be omitted in a true run of the test.

In [8]:
result = kx.knox_test(xy, t, s_bands, t_bands, n_iter=99, seed=123456)

The output of the test, stored here as `result`, is an instance of the `KnoxResult` class, which can be inspected to examine various aspects of the outcome of the test.

## Knox results

The `KnoxResult` class contains a number of forms of output, stored as Pandas DataFrames. These are the following:

- `conting_observed`: The contingency table containing counts of event pairs for the observed data 
- `p_values`: Significance values calculated for each cell of the table
- `ratios_median`: Ratios of observed to simulated counts, calculated relative to the median value across all permutations
- `ratios_mean`: Ratios of observed to simulated counts, calculated relative to the mean value across all permutations
- `z_scores`: Z-scores of observed counts relative to simulated counts

Each of these can be accessed as attributes of the result class. For any of these outputs, [interval notation](https://en.wikipedia.org/wiki/Interval_(mathematics)) is used to specify the bands.

In [9]:
result.conting_observed

Unnamed: 0,"[0, 7]","(7, 14]","(14, 21]"
"[0, 0.1]",342.0,97.0,62.0
"(0.1, 200.0]",1892.0,1514.0,1362.0
"(200.0, 400.0]",4145.0,3643.0,3592.0
"(400.0, 600.0]",5805.0,5143.0,4919.0


In [10]:
result.p_values

Unnamed: 0,"[0, 7]","(7, 14]","(14, 21]"
"[0, 0.1]",0.01,0.01,0.05
"(0.1, 200.0]",0.01,0.01,0.01
"(200.0, 400.0]",0.01,0.01,0.01
"(400.0, 600.0]",0.01,0.01,0.01


In [11]:
result.ratios_median

Unnamed: 0,"[0, 7]","(7, 14]","(14, 21]"
"[0, 0.1]",6.45283,1.979592,1.319149
"(0.1, 200.0]",1.417228,1.248145,1.140704
"(200.0, 400.0]",1.136239,1.10595,1.113798
"(400.0, 600.0]",1.130037,1.100342,1.077311


Of course, the underlying raw values can be accessed as NumPy arrays from the DataFrames.

In [12]:
result.ratios_median.values

array([[6.45283019, 1.97959184, 1.31914894],
       [1.41722846, 1.24814509, 1.14070352],
       [1.13623904, 1.10595021, 1.11379845],
       [1.13003699, 1.10034232, 1.07731056]])

These outputs can also be rendered in HTML form, coloured to reflect the results. Each of the result tables has an accompanying `plot_` function, which allows cells to be coloured in a customisable way.

The function `plot_ratios_median()`, for example, allows the ratios to be rendered using a specified colormap. It takes three arguments:

- `vmin`: The lower saturation extent of the colormap
- `vmax`: The upper saturation extent of the colormap
- `cmap`: A Matplotlib [colormap](https://matplotlib.org/gallery/color/colormap_reference.html)

In [13]:
result.plot_ratios_median(1, 3, 'autumn_r')

Unnamed: 0,"[0, 7]","(7, 14]","(14, 21]"
"[0, 0.1]",6.45283,1.979592,1.319149
"(0.1, 200.0]",1.417228,1.248145,1.140704
"(200.0, 400.0]",1.136239,1.10595,1.113798
"(400.0, 600.0]",1.130037,1.100342,1.077311


## Further options

A number of further options can also be specified when performing the Knox test.

### Interval sides

The default behaviour for the `knox_test()` function is for all band intervals (except the first) to be closed on the right side and open on the left. This behaviour is controlled by the `interval_side` parameter - this defaults to 'left', but the opposite behaviour can be achieved by setting this to 'right.

In [14]:
result = kx.knox_test(xy, t, s_bands, t_bands, n_iter=99, interval_side='right', seed=123456)
result.conting_observed

Unnamed: 0,"[0, 7)","[7, 14)","[14, 21)"
"[0, 0.1)",319.0,106.0,62.0
"[0.1, 200.0)",1647.0,1539.0,1400.0
"[200.0, 400.0)",3598.0,3660.0,3638.0
"[400.0, 600.0)",5028.0,5167.0,4935.0


As might be expected, the counts for the \[0, 7\) interval are lower than they were in the previous case, since the cases with temporal separation equal to 7 are no longer included.

### Alternative metrics

The default behaviour for the test is to compute spatial distance on a Euclidean basis (i.e. 'as the crow flies' distance in 2-dimensional space). If required, this can also be computed using ['Manhattan' distance](https://en.wikipedia.org/wiki/Taxicab_geometry) by setting the `metric` parameter. Currently 'euclidean' and 'manhattan' are the only metrics supported.

In [15]:
result = kx.knox_test(xy, t, s_bands, t_bands, n_iter=99, metric='manhattan', seed=123456)
result.conting_observed

Unnamed: 0,"[0, 7]","(7, 14]","(14, 21]"
"[0, 0.1]",342.0,97.0,62.0
"(0.1, 200.0]",1502.0,1163.0,1038.0
"(200.0, 400.0]",2806.0,2431.0,2386.0
"(400.0, 600.0]",3968.0,3553.0,3395.0


In general, the counts within this table are lower than those computed above using Euclidean distance. Again, this is not surprising - since Manhattan distance is always greater than or equal to Euclidean distance, fewer cases will fall within the same spatial bands.

## References

- Knox, G. (1964). Epidemiology of Childhood Leukaemia in Northumberland and Durham. *British Journal of Preventive and Social Medicine*, 18(1):17–24.
- Johnson, S. D., Bernasco, W., Bowers, K. J., Elffers, H., Ratcliffe, J., Rengert, G., & Townsley, M. (2007). Space–Time Patterns of Risk: A Cross National Assessment of Residential Burglary Victimization. *Journal of Quantitative Criminology*, 23(3), 201–219. https://doi.org/10.1007/s10940-007-9025-3