# Prepare example data

In order to illustrate the application of the tools contained here, an example dataset is provided. This Notebook shows some pre-processing steps to derive a dataset in suitable form for analysis.

The data will be drawn from the [Crime Open Database (CODE)](https://osf.io/zyaqn/), maintained by Matt Ashby. This collates crime data from a number of open sources in a harmonised format. The 2016 snapshot of this data was downloaded in CSV format.

The spatial data is provided in lat/lon format; here the PyProj library will be used to re-project the coordinates to metric units for distance calculations.

In [1]:
import pandas as pd
from pyproj import Proj, transform

For the test data, data from the city of **Chicago** will be used, for the offence category '**residential burglary/breaking & entering**'.

In [2]:
data = pd.read_csv("../data/crime_open_database_core_2016.csv", parse_dates=['date_single'])
data = data[data['city_name'] == "Chicago"]
data = data[data['offense_type'] == "residential burglary/breaking & entering"]
data.shape

  interactivity=interactivity, compiler=compiler, result=result)


(11701, 14)

The re-projection will use the [Illinois State Plane](http://www.spatialreference.org/ref/epsg/26971/) as the target reference system.

In [3]:
wgs84 = Proj(init='epsg:4326')
isp = Proj(init='epsg:26971')
x, y = transform(wgs84, isp, data["longitude"].values, data["latitude"].values)
data = data.assign(x=x, y=y)

Finally, save the derived data in minimal form.

In [4]:
data.to_csv("../data/test_data.csv", columns=['x','y','date_single'], date_format='%d/%m/%Y', index=False)