Chicago Taxi Data

Used in support of this post:

Code to download, process, and analyze Chicago's publicly available taxi data.

Something of a companion to the nyc-taxi-data repo. The repos share some similar code and structure, but do not explicitly depend on each other.


1. Install PostgreSQL and PostGIS

Both are available via Homebrew on Mac OS X

2. Download and import Chicago taxi data

Note: the raw data is a single uncompressed ~40GB .csv file, it will take a little while to download!


New data is available monthly. Once you've run the full setup, in future you can download and process only the latest data by running


This has the advantage of not downloading the entire 40 GB dataset every time you want to update a new month

3. Analysis

prepare_analysis.sql and analysis.R scripts to do analysis in Postgres and R

Some differences between Chicago and NYC taxi data

  • Chicago includes anonymous medallion id, New York does not
  • Chicago does not include precise location coordinates, only census tracts and community areas (and even then, only sometimes)
  • Chicago does not include precise timestamps, instead rounds pickups and drop offs to 15-minute intervals
  • Chicago does not include any data from ridesharing companies like Uber and Lyft
  • Chicago contains just over 100 million rows, making it significantly smaller than NYC's 1.3 billion rows
  • Chicago requires significantly less preprocessing and has fewer unexplained data abnormalities than the NYC data

The last two points in particular suggest that the Chicago dataset is easier to work with than the NYC dataset

Additional data sources included

