Skip to content
This repository has been archived by the owner on Jul 23, 2020. It is now read-only.

sahilchinoy/ucpd-crime

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ucpd-crime

A Daily Californian analysis of crime in the UC Berkeley area.

This repository contains tools for parsing and visualizing UCPD daily report logs from 2010 to 2015. Much of the code and methodology can be adapted to fit other data sources.

Getting started

Clone the repo and install the requirements.

pip install -r requirements.txt
npm install

Set the following environment variables:

  • DB_NAME: name of a PostGIS database
  • DB_USER: username with access to said database

If you'd like to deploy to S3 using django-bakery, set these as well:

  • AWS_BUCKET_NAME
  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY

To get started from scratch, run python manage.py load, which will call:

  • load_bins, to import hexagonal bins from a shapefile in data
  • load_ucpd, to load historical UCPD crime data
  • classify, to collapse incident information into one of three categories: violent, property or quality-of-life
  • locate, to merge location information with the address database to assign each incident a latitude and longitude
  • assign_bin, to locate each incident within a bin
  • compute_stats, to compute some basic statistics about crime across bins, across categories and over time
  • pack, to serialize incident-level information using Tamper

Data

Raw data

Incident-level reports come from a PRA filed with the UC Police Department. They cover January 2010 to September 2015. These raw data files are stored in data/ucpd.

Bins

Hexagonal bins were generate in QGIS. The shapefile is stored in data/bins.

Classification

Simple spreadsheet that maps the codes in the raw data to category codes: V for violent crimes, P for property crimes and Q for quality-of-life crimes. N is reserved for crimes that we aren't interested in analyzing or displaying.

Address database

A spreadsheet that maps the addresses in the raw data to geocoded points, which were manually corrected and checked. The address database lives in a Google spreadsheet here.

Why bins?

Tamper is a New York Times library for efficient serialization of data. We use Tamper as opposed to sending raw JSON in order to experiment with sending all incidents to the user's browser, then using Pourover to quickly sort and filter that data on the client-side.

This means we can't send coordinates for each individual incident. Instead, we assign incidents to a bin and then send only the incident's bin ID. With small enough bins, this gives a fairly detailed look at the spatial distribution of crime, and keeps the data file being sent remarkably light (41KB, in this case).

While it's more of an experiment than something of great use for data of this scale (~10 thousand incidents), it's an interesting model for scaling up to hundreds of thousands of incidents — something we've tried with historical data from the city police department.

Building and deploying

Build this site out as flat files by running python manage.py build.

If you've set the appropriate environment variables, publish to S3 using python manage.py publish.

Thanks django-bakery!

Where is this going?

We want to try scaling up this binning methodology to bigger datasets. That would involve creating a new shapefile and coming up with new address and classification dictionaries, but the rest of the loading, binning and serialization code should work.

We tried a few Pourover filters other than our basic classification (violent, property, quality-of-life), but none of them ended up being interesting for this particular dataset. For categorical variables, though, this project can accomplish some very fast visualizations of geospatial data — without running a server.