Skip to content

More about the datasets in this package

Angela Li edited this page Jun 22, 2020 · 4 revisions

The datasets in geodaData provide the datasets on the GeoDa Data and Lab site in an R-friendly format. However, it is important to note that the datasets themselves may not be consistent, and this post attempts to explain why.

The original datasets were compiled over a period of 20 years and thus have not been standardized into any legible format (i.e. names are inconsistent, variables are inconsistent, metadata is not properly provided for all datasets, origins of other datasets have been lost). From talking to Julia Koschinsky, there were three main stages in putting together the GeoDa data site:

  1. Historical, small scale datasets. These are spatial datasets with fewer observations used in older published papers, such as the Guerry dataset, the Columbus dataset, and so on (from Julia: bostonhsg, baltimore, buenos aires, columbus, grid100, laozone, lasrosas, NCOVR/natregimes, oz9799, police, scotlip, SIDS/SIDS2). I believe this data comes from the time Julia and Luc were at UIUC. They are mainly stored for archival purposes.
  2. Medium Census datasets. These are medium-size datasets that were put together at Arizona State University, and include all of the 2000 Census Tract datasets for various cities (Sacramento1, Sacramento2, etc).
  3. Modern, larger-scale datasets. After the Center for Spatial Data Science was established at the University of Chicago in 2017, Luc put together a number of datasets for his new classes (i.e. Introduction to Spatial Data Science) and started collecting datasets cleaned by students for their class projects. These are slightly larger and often come from open data portals or companies, and include AirBnb data, the Chicago health data, abandoned cars and more.
Clone this wiki locally