Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
1 contributor

Users who have contributed to this file

10894 lines (10696 sloc) 329 KB

This readme explains how to make best use of the data from the Stanford Open Policing Project. We provide an overview of the data and a list of best practices for working with the data.

Our analysis code and further documentation are available at https://github.com/5harad/openpolicing.

Overview of the data file structure

For each dataset, we provide 4 files:

  1. A zipped csv file of the cleaned data
  2. An RDS of the cleaned data
  3. Tarballed (zipped) shapefiles
  4. Tarballed (zipped) raw data (available upon request)

Description of standardized data

Each row in the cleaned data represents a stop. The following details the maximal set of features we attempted to extract from each location. Coverage varies by location. Fields with an asterisk were removed for public release due to privacy concerns. All columns except raw_row_number, violation, disposition, location, officer_assignment, any city or state subgeography (i.e. county, beat, division, etc), unit, and vehicle_{color,make,model,type} are also digit sanitized (each digit replaced with "-") for privacy concerns.

Note that many locations have additional information that could be extracted (e.g., zip code), but we do not designate a standardized column for information beyond what is listed below, either because we do not use the information in our analysis and/or because not enough locations provided this information. We do pull through some additional columns (discussed on a location-by-location basis within this readme), which have column names prefaced by "raw_".

Column name Column meaning Example value
raw_row_number An number used to join clean data back to the raw data 38299
date The date of the stop, in YYYY-MM-DD format. Some states do not provide the exact stop date: for example, they only provide the year or quarter in which the stop occurred. For these states, stop_date is set to the date at the beginning of the period: for example, January 1 if only year is provided. "2017-02-02"
time The 24-hour time of the stop, in HH:MM format. 20:15
location The freeform text of the location. Occasionally, this represents the concatenation of several raw fields, i.e. street_number, street_name "248 Stockton Rd."
lat The latitude of the stop. If not provided by the department, we attempt to geocode any provided address or location using Google Maps. Google Maps returns a "best effort" response, which may not be completely accurate if the provided location was malformed or underspecified. To protect against suprious responses, geocodes more than 4 standard deviations from the median stop lat/lng are set to NA. 72.23545
lng The longitude of the stop. If not provided by the department, we attempt to geocode any provided address or location using Google Maps. Google Maps returns a "best effort" response, which may not be completely accurate if the provided location was malformed or underspecified. To protect against suprious responses, geocodes more than 4 standard deviations from the median stop lat/lng are set to NA. 115.2808
county_name County name where provided "Allegheny County"
neighborhood This is the neighborhood of the stop and some police departments will provide this instead of a location or beat. "GRNBELT"
beat Police beat. If not provided, but we have retrieved police department shapfiles and the location of the stop, we geocode the stop and find the beat using the shapefiles. 8
district Police district. If not provided, but we have retrieved police department shapfiles and the location of the stop, we geocode the stop and find the district using the shapefiles. 8
subdistrict Police subdistrict. If not provided, but we have retrieved police department shapfiles and the location of the stop, we geocode the stop and find the subdistrict using the shapefiles. 8
division Police division. If not provided, but we have retrieved police department shapfiles and the location of the stop, we geocode the stop and find the division using the shapefiles. 8
subdivision Police subdivision. If not provided, but we have retrieved police department shapfiles and the location of the stop, we geocode the stop and find the subdivision using the shapefiles. 8
police_grid_number Police grid number. If not provided, but we have retrieved police department shapfiles and the location of the stop, we geocode the stop and find the police grid number using the shapefiles. 8
precinct Police precinct. If not provided, but we have retrieved police department shapfiles and the location of the stop, we geocode the stop and find the precinct using the shapefiles. 8
region Police region. If not provided, but we have retrieved police department shapfiles and the location of the stop, we geocode the stop and find the region using the shapefiles. 8
reporing_area Police reporting area. If not provided, but we have retrieved police department shapfiles and the location of the stop, we geocode the stop and find the reporting area using the shapefiles. 8
sector Police sector. If not provided, but we have retrieved police department shapfiles and the location of the stop, we geocode the stop and find the sector using the shapefiles. 8
subsector Police subsector. If not provided, but we have retrieved police department shapfiles and the location of the stop, we geocode the stop and find the subsector using the shapefiles. 8
substation Police substation. If not provided, but we have retrieved police department shapfiles and the location of the stop, we geocode the stop and find the substation using the shapefiles. 8
service_area Police service area. If not provided, but we have retrieved police department shapfiles and the location of the stop, we geocode the stop and find the service area using the shapefiles. 8
zone Police zone. If not provided, but we have retrieved police department shapfiles and the location of the stop, we geocode the stop and find the zone using the shapefiles. 8
subject_age The age of the stopped subject. When date of birth is given, we calculate the age based on the stop date. Values outside the range of 10-110 are coerced to NA. 54.23
subject_dob* The date of birth of the stopped subject. "1956-02-23"
subject_yob* The year of birth of the subject. 1983
subject_race The race of the stopped subject. Values are standardized to white, black, hispanic, asian/pacific islander, and other/unknown "hispanic"
subject_sex The recorded sex of the stopped subject. "female"
officer_id* Officer badge number or other form of identification provided by the department. 8
officer_id_hash A unique hash of the officer id used to identify individual officers within a location. "a888fdc120"
officer_age The age of the stopped officer. When date of birth is given, we calculate the age based on the stop date. Values outside the range of 10-100 are coerced to NA. 54.23
officer_dob* The date of birth of the stopped officer. "1956-02-23"
officer_race The race of the stopped officer. Values are standardized to white, black, hispanic, asian/pacific islander, and other/unknown "hispanic"
officer_sex The recorded sex of the stopped officer. "female"
officer_first_name* First name of the officer when provided. "MIGUEL"
officer_last_name* Last name of the officer when provided. "JEFFERSON"
officer_years_of_service Number of years officer has been with the police department. 22
officer_assignment Department or subdivision to which officer has been assigned. "8th District"
department_id ID of department or subdivision to which officer has been assigned. 90
department_name Name of department or subdivision to which officer has been assigned. 90
unit Unit to which officer has been assigned. "Patrol-1st"
type Type of stop: vehicular or pedestrian. "vehicular"
disposition Disposition of stop where provided. What is recorded here varies widely across police departments. "GUILTY"
violation Specific violation of stop where provided. What is recorded here varies widely across police departments. "SPEEDING 15-20 OVER"
arrest_made Indicates whether an arrest made. FALSE
citation_issued Indicates whether a citation was issued. TRUE
warning_issued Indicates whether a warning was issued. TRUE
outcome The strictest action taken among arrest, citation, warning, and summons. "citation"
contraband_found Indicates whether contraband was found. When search_conducted is NA, this is coerced to NA under the assumption that contraband_found shouldn't be discovered when no search occurred and likely represents a data error. FALSE
contraband_drugs Indicates whether drugs were found. This is only defined when contraband_found is true. TRUE
contraband_weapons Indicates whether weapons were found. This is only defined when contraband_found is true. TRUE
contraband_other Indicates whether contraband other than drugs and weapons were found. This is only defined when contraband_found is true. TRUE
frisk_performed Indicates whether a frisk was performed. This is technically different from a search, but departments will sometimes include frisks as a search type. TRUE
search_conducted Indicates whether any type of search was conducted, i.e. driver, passenger, vehicle. Frisks are excluded where the department has provided resolution on both. TRUE
search_person Indicates whether a search of a person has occurred. This is only defined when search_conducted is TRUE. TRUE
search_vehicle Indicates whether a search of a vehicle has occurred. This is only defined when search_conducted is TRUE. TRUE
search_basis This provides the reason for the search where provided and is categorized into k9, plain view, consent, probable cause, and other. If a serach occurred but the reason wasn't listed, we assume probable cause. "consent"
reason_for_arrest A freeform text field indicating the reason for arrest where provided. "outstanding warrant"
reason_for_frisk A freeform text field indicating the reason for frisk where provided. "suspicious movement"
reason_for_search A freeform text field indicating the reason for search where provided. "odor of marijuana"
reason_for_stop A freeform text field indicating the reason for the stop where provided. "EQUIPMENT MALFUNCTION"
speed The recorded speed of the vehicle for the stop. 76.2
posted_speed The speed limit where the stop was recorded. 55
use_of_force_description A freeform text field describing the use of force. "handcuffed"
use_of_force_reason A freeform text field describing the reason for the use of force. "weapons / violence related incident"
vehicle_color A freeform text of the vehicle color where provided; format varies widely. "BLK"
vehicle_make A freeform text of the vehicle make where provided; format varies widely. "TOYOTA"
vehicle_model A freeform text of the vehicle model where provided; format varies widely. "Cherokee"
vehicle_type A freeform text of the vehicle type where provided; format varies widely. "TRUCK"
vehicle_registration_state A freeform text of the vehicle registration state where provided; format varies widely. "CA"
vehicle_year Vehicle manufacture year where provided. This value is NA for any year before 1800. 2007
notes A freeform text field containing any officer notes. "NO PASSENGERS"
* Removed for public release for privacy reasons.

Best practices

We provide some lessons we’ve learned from working with this rich, but complicated data.

  1. Read over the notes and processing code if you are going to focus on a particular location, so you’re aware of the judgment calls we made in processing the data. Taking a look at the original raw data is also wise (and may uncover additional fields of interest).
  2. Start with the cleaned data from a single small location to get a feel for the data. Rhode Island, Vermont, and Connecticut are all load quickly.
  3. Note that loading and analyzing every state simultaneously takes significant time and computing resources. One way to get around this is to compute aggregate statistics from each state. For example, you can compute search rates for each age, gender, and race group in each state, save those rates, and then quickly load them to compute national-level statistics broken down by age, race, and gender.
  4. Take care when making direct comparisons between locations. For example, if one state has a far higher consent search rate than another state, that may reflect a difference in search recording policy across states, as opposed to an actual difference in consent search rates.
  5. Examine counts over time in each state: for example, total numbers of stops and searches by month or year. This will help you find years for which data is very sparse (which you may not want to include in analysis).
  6. Do not assume that all disparities are due to discrimination. For example, if young men are more likely to receive citations after being stopped for speeding, this might simply reflect the fact that they are driving faster.
  7. Do not assume the standardized data are absolutely clean. We discovered and corrected numerous errors in the original data, which were often very sparsely documented and changed from year to year, requiring us to make educated guesses. This messy nature of the original data makes it unlikely the cleaned data are perfectly correct.
  8. Do not read too much into very high stop, search, or other rates in locations with very small populations or numbers of stops. For example, if a county has only 100 stops of Hispanic drivers, estimates of search rates for Hispanic drivers will be very noisy and hit rates will be even noisier. Similarly, if a county with very few residents has a very large number of stops, it may be that the stops are not of county residents, making stop rate computations misleading.

The following contains date ranges, coverage rates, and some notes on each location. A coverage rate is 1 - null rate, so it represents the proportion of data that have values for that feature. The reported coverage rates are also predicated, which means that some columns coverage is calculated only after considering another column. For instance, the coverage for contraband_found is reported after filtering to instances where search_conducted was true. In a similar fashion, search_basis and reason_for_search are only calculated when search conducted is true, reason_for_arrest when arrest_made is true, and contraband_drugs, contraband_weapons, and contraband_alcohol, and _contraband_other when contraband_found is true.

The notes are not intended to be a comprehensive description of all the data features in every state, since this would be prohibitively lengthy. Rather, they are brief observations we made while processing the data. We hope they will be useful to others. They are worth reading prior to performing detailed analysis of a location.

Our analysis only scratches the surface of what’s possible with these data. We’re excited to see what you come up with!

Little Rock, AR

2017-01-01 to 2017-11-03

feature coverage rate
date 100.0%
time 100.0%
lat 2.4%
lng 2.4%
subject_age 99.8%
subject_race 100.0%
subject_sex 99.8%
officer_first_name 100.0%
officer_last_name 100.0%
type 100.0%
citation_issued 100.0%
outcome 100.0%
vehicle_type 100.0%
raw_defendant_race 100.0%

Data notes:

  • lat/lng data doesn't appear totally accurate, there are ~18k lat/lngs that were coerced to NA because they all equalled "-1.79769313486232E+308"
  • Data is deduplicated on date, time, lat, lng, race, sex, and officer name, reducing the number of records by ~30.6%
  • Data consists only of citations
  • raw_defendant_race represents Defendant Race in the raw data and is the column from which subject_race is derived

Gilbert, AZ

2008-01-01 to 2018-05-23

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 99.9%
lng 99.9%
officer_id 100.0%
officer_id_hash 100.0%
officer_first_name 100.0%
officer_last_name 100.0%
type 100.0%
vehicle_color 0.1%
vehicle_make 0.1%
vehicle_model 0.1%
vehicle_year 0.0%

Data notes:

  • Data is deduplicated on call_id, reducing the number of records 17.6%; this was equivalent to deduping on date, time, location, and officer_id; subject name appears to have been entered multiple times per call_id, and often in subtly different formats
  • Most important data is missing, including outcome (arrest, citation, warning), reason for stop, search, contraband, and demographic information on the subject (except name, which is redacted for privacy)
  • call_type was either TS (traffic stop) or SS (subject stop), which we translated to 'vehicular' or 'pedestrian' stops

Mesa, AZ

2014-01-01 to 2017-03-31

feature coverage rate
date 100.0%
time 99.8%
location 99.4%
lat 98.5%
lng 98.5%
subject_age 98.5%
subject_race 100.0%
subject_sex 98.7%
officer_id 100.0%
officer_id_hash 100.0%
officer_last_name 100.0%
type 93.0%
violation 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 100.0%
raw_race_fixed 100.0%
raw_ethnicity_fixed 100.0%
raw_charge 100.0%

Data notes:

  • INCIDENT_NO appears to refer to the same incident but can involve multiple people, i.e. 20150240096, which appears to be an alcohol bust of several underage teenagers; in other instances, the rows look nearly identical, but given this information and checking several other seeming duplicates, it appears as though there is one row per person per incident
  • violation is charge_desc in the raw data, and raw_charge represents the charge code in the raw data
  • subject_race was derived from ethnicity_fixed and race_fixed in the raw data, provided in the clean data with raw_*

Statewide, AZ

2009-01-06 to 2017-12-31

feature coverage rate
date 99.9%
time 99.9%
location 99.9%
county_name 89.3%
subject_race 99.9%
subject_sex 99.9%
officer_id 99.9%
officer_id_hash 99.9%
type 100.0%
violation 33.2%
arrest_made 99.9%
citation_issued 99.9%
warning_issued 99.9%
outcome 89.3%
contraband_found 100.0%
contraband_drugs 100.0%
contraband_other 6.0%
search_conducted 100.0%
search_person 98.5%
search_vehicle 100.0%
search_basis 100.0%
reason_for_stop 69.3%
vehicle_type 97.2%
vehicle_year 98.4%
raw_Ethnicity 99.9%
raw_OutcomeOfStop 99.9%
raw_ReasonForStop 99.9%
raw_TypeOfSearch 3.3%
raw_ViolationsObserved 56.5%

Data notes:

  • Counties were mapped in two ways. First, we determined which counties the codes in the County field referred to by using the highways which appeared most frequently in each coded county. Second, for stops which had no data in the County field, we used the values in the Highway and Milepost fields to estimate where the stop took place. For this, we relied on highway marker maps (sources: here and here to map the most frequently traversed highways, which covered the vast majority of stops. Using these two methods, we were able to map 95% of stops which had any location data (i.e., values in either County or Highway and Milepost), and 89% of stops overall.
  • It would be possible to map the highway and mile marker data to geo coordinates, like we did in Washington.
  • There is a two-week period in October 2012 and a two-week period in November 2013 when no stops are recorded. We also are missing December 2015. Dates are sparse in 2009–2010 (and even up until mid-2011).
  • We also received a file with partial data on traffic stops pre-2009; this is not included in the dataset.
  • Data for violation reason is largely missing.
  • Raw column VehicleSearchAuthority and DriverSearchAuthority seem to provide search basis but we lack a mapping for the codes. ConsentSearchAccepted gives us information on search type for a small fraction of searches.
  • raw_TypeOfSearch includes information on who was searched (e.g., driver vs. passenger), but does not provide information on the type of search (e.g., probable cause vs. consent).
  • Some contraband information is available and so we define a contraband_found column in case it is useful to other researchers. But the data is messy and there are multiple ways contraband_found might be defined, and so we do not include Arizona in our contraband analysis.
  • Additional raw data columns that may be of interest: ConsentSearchRequested (note that there is also a raw column ConsentSearchAccepted -- which populates the clean values search_basis == "consent"), IfConsentRequestGranted, (FS, RS, NA), SubjectDemeanor (CO, UN, CM, NA), StopDuration (A-F, NA), DistractedDriving (1-2 word free field), ImmigrationStatusCheck (boolean, nearly all NA), VehicleImpounded (Y, N, I, NA), ImpoundReason (LI, NL, CN, DE, DM, II, UA, NA), TypeOfContact (D, P, E, N, C, NA), DrugSeizureType (combinations of P, S, T), DUIBAC, DUICharges, DUITests (combinations of B, I, U), PreStopIndicator (VT = "Vehicle Type, Condition or Modification", BL = "Driver Body Language", PB = "Passenger Behavior", DB = "Driving Behavior", OT = "Other", NO = "None")

Anaheim, CA

2012-01-01 to 2017-03-14

feature coverage rate
date 100.0%
type 100.0%
reason_for_stop 100.0%

Data notes:

  • Very little information received, only a reference number, date, year, case type (with no translation), and a case type (with no translation)
  • reason_for_stop is Final Case Type D in the raw data

Bakersfield, CA

2008-03-09 to 2018-03-09

feature coverage rate
date 100.0%
time 99.5%
location 99.9%
lat 98.6%
lng 98.6%
beat 91.3%
subject_age 99.5%
subject_dob 99.4%
subject_race 99.6%
subject_sex 99.6%
officer_id 100.0%
officer_id_hash 100.0%
type 100.0%
citation_issued 100.0%
outcome 100.0%
raw_ethnicity 95.8%
raw_statute_name 100.0%
raw_statute_section 100.0%
raw_race 99.6%

Data notes:

  • Data is deduplicated on raw columns date_of_birth, subject_address, ethnicity, gender_code, occ_date, occ_time, reducing the number of records by ~1.2%
  • Data does not include reason for stop, search, contraband fields
  • Missing data dictionaries for ticket classes, ticket statuses, and statute section
  • subject_race is based on ethnicity and race, the raw columns are provided in the clean data
  • We currently have no data dictionaries for statute_section and statute_name, but they are passed through to the clean data
  • Data consists only of citations

Oakland, CA

2013-04-01 to 2017-12-31

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 99.9%
lng 99.9%
beat 45.7%
subject_age 23.0%
subject_race 100.0%
subject_sex 99.9%
officer_assignment 9.0%
type 85.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 74.4%
contraband_found 100.0%
contraband_drugs 100.0%
contraband_weapons 100.0%
search_conducted 100.0%
search_basis 100.0%
reason_for_stop 100.0%
use_of_force_description 12.5%
raw_subject_sdrace 100.0%
raw_subject_resultofencounter 100.0%
raw_subject_searchconducted 100.0%
raw_subject_typeofsearch 60.9%
raw_subject_resultofsearch 16.3%

Data notes:

  • Data is deduplicated on raw columns contactdate, contacttime, streetname, subject_sdrace, subject_sex, and subject_age, reducing the number of records by ~5.2%
  • Stops from 2013-2015 don't have encountertype like 2016-2017, so we attempt to pull it out from ReasonForEncounter; however, this breakdown is imprecise, because while one category is "Traffic Violation", another is "Probable Cause"; presumably, "Probable Cause" could be a reason for a vehicular stop; so, the stop is type vehicular if the encountertype was vehicular or the reason for encounter involved a traffic violation; it was classified as pedestrian if the encountertype was pedestrian or bicycle, otherwise this field is NA, since we can't say whether "Probable Cause" or "Reasonable Suspicion" was a vehicular or pedestrian stop
  • Contraband is encoded based on ResultOfSearch (pedestrian) and subject_resultofsearch (vehicular); None, NA, and anything with "Returned" after it are excluded, i.e. "Marijuana - Returned", "Other Weapons - Returned," under the assumption that returned items were not contraband
  • 2013 is missing the first 3 months of data and 2015 is missing the last 3 months of data
  • Some of the raw columns were named similarly but not exactly the same across years, i.e. ResultOfSearch in 2013, 2014, and 2015, but subject_resultofsearch in 2016 and 2017; these were renamed to be consistent with the latter years in the raw data loading function
  • subject_{resultofencounter,typeofsearch,search_conducted,resultofsearch} formed the foundation for search and contraband fields and are passed through in the clean data
  • subject_race is derived from subject_sdrace, which is passed through to the clean data

San Bernardino, CA

2011-12-13 to 2017-09-19

feature coverage rate
date 100.0%
time 100.0%
location 98.6%
lat 93.0%
lng 93.0%
type 72.0%
disposition 99.7%
arrest_made 99.7%
citation_issued 99.7%
outcome 37.8%
raw_CallType 100.0%

Data notes:

  • Data is deduplicated on raw columns CreateDateTime, Address, and CallType, removing ~26.3% of records
  • Data does not include most useful information, including demographic, outcome, and search/contraband information, so the deduplication above potentially over-deduplicates
  • type is derived from CallType, which is passed through since we lack a data dictionary for some of them

Long Beach, CA

2008-01-01 to 2017-12-31

feature coverage rate
date 100.0%
location 100.0%
lat 74.4%
lng 74.4%
beat 66.2%
district 66.2%
subdistrict 66.2%
division 66.2%
subject_age 94.5%
subject_race 100.0%
subject_sex 99.9%
officer_id 100.0%
officer_id_hash 100.0%
officer_age 100.0%
officer_race 100.0%
officer_sex 100.0%
officer_years_of_service 100.0%
type 92.2%
violation 99.7%
citation_issued 100.0%
outcome 100.0%
vehicle_make 85.6%
vehicle_registration_state 83.9%
vehicle_year 81.2%
raw_race 100.0%
raw_sex 100.0%
raw_officer_race 100.0%

Data notes:

  • Data is deduplicated on raw columns Date, Location, Race, Sex, and Officer DID, reducing the number of records by ~14.3%
  • Data does not include reason for stop, search, or contraband fields
  • violation is a concatenation of 4 violation descriptions, separated by ';'
  • type is derived from violation_1_description
  • raw columns sex, race, and officer_race are passed through since our translations may simplify them
  • There is a notable drop in stops from 2008 to 2016, unclear what the origin of this may be

Los Angeles, CA

2010-01-01 to 2018-06-23

feature coverage rate
date 100.0%
time 100.0%
district 100.0%
region 100.0%
subject_race 100.0%
subject_sex 100.0%
officer_id 100.0%
officer_id_hash 100.0%
type 100.0%
raw_descent_description 100.0%
  • Data is deduplicated on raw columns stop_date, stop_time, reporting_district, division_description_1, division_description_2, officer_1_serial_number, officer_2_serial_number, descent_description, sex_code, and stop_type, reducing the number of records by ~17.7%
  • Search/contraband, outcome, and location data are missing
  • subject_race is derived from descent_description, which is passed through

San Diego, CA

2014-01-01 to 2017-03-31

feature coverage rate
date 100.0%
time 99.8%
service_area 100.0%
subject_age 96.9%
subject_race 99.7%
subject_sex 99.8%
type 100.0%
arrest_made 90.9%
citation_issued 91.6%
warning_issued 91.6%
outcome 89.8%
contraband_found 100.0%
search_conducted 100.0%
search_person 99.4%
search_vehicle 99.4%
search_basis 100.0%
reason_for_search 87.7%
reason_for_stop 99.9%
raw_action_taken 91.6%
raw_subject_race_description 99.7%

Data notes:

  • stop_id in raw data doesn't appear to apply to unique events, as the same id has different service_area, subject_race, subject_age, and subject_sex, i.e.1099162
  • Data is deduplicated on raw columns timestamp, subject_race, subject_sex, subject_age, and service_area, reducing the number or records by ~2.0%
  • There are no locations, but service_area is provided
  • subject_race is derived from subject_race_description which is passed through
  • reason_for_search is named SearchBasis in the raw data and search_basis is derived from this column
  • outcomes are based on ActionTaken, which is passed through as raw_action_taken
  • search_conducted is named searched in the raw data; when searched is NA, this is interpreted as FALSE for search_conducted under the assumption that officers sometimes don't record the absence of a search. Furthermore, where searched is NA, SearchBasis, SearchBasisOther, and SearchType are all NA, as well, suggesting that no search occurred
  • If search_conducted was true but contraband_found was NA, it was changed to false, under the assumption that NA means false when a search is performed
  • 2017 only has data for part of the year

San Francisco, CA

2007-01-01 to 2016-06-30

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 99.8%
lng 99.8%
district 94.2%
subject_age 93.5%
subject_race 100.0%
subject_sex 100.0%
type 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 98.3%
contraband_found 100.0%
search_conducted 100.0%
search_vehicle 100.0%
search_basis 100.0%
reason_for_stop 99.8%
raw_search_vehicle_description 100.0%
raw_result_of_contact_description 100.0%

Data notes:

  • Search basis in the raw data is only "No Search", consent, or other (inventory, incident to arrest, and parole searches)
  • Contraband found is derived from search_vehicle_description, which, unfortunately, only has the search basis and "Positive Result" or "Negative Result", the former indicating when contraband found is true; it is passed through to the clean data
  • outcomes are based on result_of_contact_description, which is passed through
  • Data is deduplicated on raw columns date, time, race_description, sex, age, location, removing ~0.3% of stops

San Jose, CA

2013-09-01 to 2018-03-31

feature coverage rate
date 100.0%
time 100.0%
location 92.5%
lat 88.9%
lng 88.9%
subject_race 96.4%
type 88.7%
arrest_made 99.8%
citation_issued 99.8%
outcome 38.0%
contraband_found 100.0%
search_conducted 97.7%
reason_for_stop 94.8%
use_of_force_description 88.6%
use_of_force_reason 92.2%
raw_search 96.4%
raw_call_desc 100.0%
raw_race 96.4%
raw_event_desc 99.8%

Data notes:

  • event_number in raw data has indeterminate meaning, several event numbers occur at the same time but have up to 16 duplicates; however, some of these involve different subjects, so it's unclear whether they are distinct incidents or large incidents involving many people
  • Data is deduplicated using date, time, location, subject race, and raw_search (SEARCH in raw data); this removes about ~4.4% of records, but many of these rows are lacking sufficient information for differentiation, i.e. they have NA for many of their values
  • search_conducted is derived from SEARCH (raw_search in clean data); NAs in the original column and converted to FALSE, under the assumption that officers sometimes don't record the absence of a search. However, there are other values other than the ones provided in the data dictionary (and are not NA), these are converted to NA for search_conducted but available in the raw_search column for review; some of them appear to be the result malformed rows and/or incorrect data entry, i.e. some of them could be race classifications
  • type is based on TYCOD DESCRIPTION, which is passed through as raw_call_desc
  • race is passed through to provide access to greater granularity
  • a translation of EVENT DISPO is provided as raw_event_desc; this was used for outcomes
  • 2013 and 2018 only have partial data

Santa Ana, CA

2014-06-11 to 2018-04-13

feature coverage rate
date 100.0%
location 100.0%
lat 99.9%
lng 99.9%
district 96.1%
region 96.6%
subject_race 99.8%
subject_sex 100.0%
officer_id 100.0%
officer_id_hash 100.0%
type 99.9%
violation 100.0%
citation_issued 100.0%
outcome 100.0%
raw_race 99.8%

Data notes:

  • Deduping on raw columns Date, Race, Sex, Violation Description, Officer (Badge), and Primary Street would reduce this dataset by ~9.7%, but there is insufficient information to justify this without the incident time. For instance, the highest frequency "incident" deduping on that critera was 16 male Hispanic drivers failing to stop at a stop sign by the same officer on 5th Street; while this could be 16 duplicates, it could also be the same officer pulling over 16 people throughtout that day
  • Data does not include search or contraband information
  • Data includes only citations
  • 2014 and 2018 only contain partial data

Statewide, CA

2009-07-01 to 2016-06-30

feature coverage rate
date 100.0%
county_name 99.7%
district 99.7%
subject_race 100.0%
subject_sex 100.0%
department_name 100.0%
type 100.0%
violation 100.0%
arrest_made 69.8%
citation_issued 69.8%
warning_issued 69.8%
outcome 69.8%
contraband_found 4.3%
frisk_performed 0.2%
search_conducted 100.0%
search_person 96.8%
search_basis 100.0%
reason_for_stop 100.0%
raw_race 100.0%
raw_search_basis 100.0%

Data notes:

  • CHP districts roughly map to counties, so we mapped stops to counties using the map of CHP districts, which is included in the raw data. Some counties appear to have very high stop rates; this is because they have very small populations. It seems likely that the stops occurring in those counties are not actually the resident population.
  • Driver age categories are included in the raw data; these cannot be mapped to granular values, so we cannot fill out the driver_age field.
  • Driver race was recorded with high granularity. Raw mapping:
    • A = Other Asian
    • B = Black
    • C = Chinese
    • D = Cambodian
    • F = Filipino
    • G = Guamanian
    • H = Hispanic
    • I = Indian
    • J = Japanese
    • K = Korean
    • L = Laotian
    • O = Other
    • P = Other Pacific Islander
    • S = Samoan
    • U = Hawaiian
    • V = Vietnamese
    • W = White
    • Z = Asian Indian subject_race is mapped from raw_race above.
  • Search basis was recorded more finely in raw data. Raw mapping:
    • 1 = Probable Cause (positive)
    • 2 = Probable Cause (negative)
    • 3 = Consent (positive), 202D Required
    • 4 = Consent (negative), 202D Required
    • 5 = Incidental to Arrest
    • 6 = Vehicle Inventory
    • 7 = Parole / Probation / Warrant
    • 8 = Other
    • 9 = Pat Down / Frisk search_basis is mapped from raw_search_basis above.
  • Very few consent searches are conducted relative to other states.
  • Contraband found information is only available for a small subset of searches: the raw data can tell you if a probable cause search or a consent search yielded contraband, but cannot tell you if contraband was located during a search conducted incident to arrest. (Note that in many cases we cast NA contraband to F, but in this case we do not, because we simply do not have contraband recovery data for non-discretionary searches). We still include California in our contraband analysis because exclude non-discretionary searches like those incident to arrest.
  • Raw data contains shift time is included, but is not sufficiently granular to yield reliable stop time.

Stockton, CA

2012-01-01 to 2016-12-31

feature coverage rate
date 100.0%
division 99.6%
subject_age 99.4%
subject_race 99.5%
subject_sex 99.6%
officer_id 54.9%
officer_id_hash 54.9%
type 100.0%
arrest_made 99.7%
citation_issued 99.7%
warning_issued 99.7%
outcome 99.5%
search_conducted 100.0%
search_basis 100.0%
reason_for_stop 99.6%
raw_result 99.7%
raw_search 99.7%

Data notes:

  • Data consists of two sets of files, traffic stop surveys and CAD stop files, but currently there is no information on how to join them; location is in the stop files, but all other demographic information is in the traffic stop survey files
  • There may be duplicates, but unclear how to identify them, as date, age, gender, and race are the only consistently filled in fields, and the maximum number of stops for any date, age, gender, race combination is 10, which is a reasonable number of stops for that combination over the course of a day in the entire city occasionally
  • officer_id is coalesced officer_id and officer_id2, the former being 90% null and the latter 50% null in the dataset
  • Outcomes are based on raw column result, which is passed through
  • search_conducted and search_basis are derived from the raw column search, which is passed through; where SEARCH was NA, search_conducted as set to false, under the assumption that sometimes officers don't record the absence of a search
  • 2012 has suspiciously little data

Aurora, CO

2012-01-01 to 2016-12-31

feature coverage rate
date 100.0%
time 99.5%
location 100.0%
lat 81.8%
lng 81.8%
district 80.6%
subject_age 96.6%
subject_dob 96.5%
subject_race 100.0%
subject_sex 98.9%
type 97.5%
violation 98.0%
citation_issued 100.0%
outcome 100.0%
raw_ethnicity 12.9%
raw_race 100.0%

Data notes:

  • Data is deduplicated on raw columns Ticket Date, Ticket Time, Ticket Location, First Name, Last Name, sex, and Date of Birth, reducing the number of records by ~1.0%
  • subject_race was based on Race and Ethnicity in the raw data, which are passed through

Denver, CO

2010-12-31 to 2018-07-19

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 100.0%
lng 100.0%
district 100.0%
precinct 100.0%
type 100.0%
disposition 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 54.8%

Data notes:

  • MASTER_INCIDENT_NUMBER has many duplicates, but it's unclear what it corresponds to or how to deduplicate it if that is the correct thing to do, since the records are nearly identical except for the NEIGHBORHOOD_NAME
  • Data does not contain subject demographic or search/contraband information

Statewide, CO

2010-01-01 to 2017-12-31

feature coverage rate
date 100.0%
time 0.0%
location 100.0%
county_name 100.0%
subject_age 72.4%
subject_dob 70.3%
subject_race 87.1%
subject_sex 71.2%
officer_id 83.0%
officer_id_hash 83.0%
officer_sex 31.8%
officer_first_name 82.9%
officer_last_name 82.9%
type 100.0%
violation 83.9%
arrest_made 54.1%
citation_issued 54.1%
warning_issued 54.1%
outcome 41.6%
contraband_found 100.0%
search_conducted 92.6%
search_basis 98.0%
raw_Ethnicity 87.8%

Data notes:

  • The state did not provide us with mappings for every police department code to police department name.
  • Arrest and citation data are unreliable from 2014 onward. Arrest rates drop essentially to zero.
  • Counties were mapped using a dictionary provided by the agency. Denver County has many fewer stops than expected given the residential population; this is because it only contains a small section of highway which is policed by the state patrol.
  • Rows in raw data represent violations, not stops, so we remove duplicates by grouping by the other fields.
  • subject_race was mapped from raw_Ethnicity.
  • Note that data from 2016 came with about 80 fewer columns than the data pre-2016 and after 2016, so many values for that year will be NA, including search data (see below for details).
  • The data came in three files, the first covered 2010-March 2016; this has full data. The second covered Jan-Dec 2016; this was missing many columns, including whether a search was conducted. The third data file covered Jan-Dec 2017 and had full data. In order to preserve as much search data as possible we use the second file with missing data only to fill in the nine months of April-Dec 2016. This, in particular, affects the marijuana analysis search rate time series.
  • Additional columns in the raw data that may be of interest: MMJCard, DUIDType, NonUS, NonUSDL, NonUSDLLocation, DLCheck, TrafficAccident, AccidentSeverity (0-4), DUIArrest, HVPTCitation, SeatBeltCitation, FelonyArrest, Misdemeanors, Felonies (count), VehicleInspected, Recoveries, TrafficOral, AssistOral, AllOtherOral, GrantCategory, GrantLabel, Assists, AssistsMultiple, AssistsCount, ContrabandCharge (petty offense, felony, misdemeanor, none, traffic), Warrant, MisdemeanorOrFelony (M or F)

Statewide, CT

2013-10-01 to 2015-10-01

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 21.5%
lng 21.5%
county_name 100.0%
subject_age 99.7%
subject_race 100.0%
subject_sex 100.0%
officer_id 89.0%
officer_id_hash 89.0%
department_name 100.0%
type 98.2%
violation 99.9%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 98.4%
contraband_found 100.0%
search_conducted 100.0%
search_vehicle 100.0%
search_basis 95.0%
reason_for_stop 50.4%
raw_SubjectRaceCode 100.0%
raw_SubjectEthnicityCode 100.0%
raw_SearchAuthorizationCode 100.0%

Data notes:

  • Counties were mapped by running the cities in the Intervention Location Name field through Google's geocoder.
  • Rows appear to represent violations, not individual stops, because a small proportion of rows (1%) report the same officer making multiple stops at the same location at the same time. We grouped the data to combine these duplicates. We don't want to be overly aggressive in grouping together stops, so we only group if the other fields are the same.
  • While there is some search type data, a high fraction of searches are marked as "Other".
  • While there is some violation data, too much is missing.
  • Race (raw_SubjectRaceCode, raw_SubjectEthnicityCode) mapping:
    • A = Asian/Pacific Islander
    • B = Black
    • H = Hispanic
    • W = White
    • I = Native American
  • Search basis (raw_SearchAuthorizationCode) mapping:
    • C = consent
    • O = probable cause
    • I = inventory
  • The Connecticut state patrol created another website (link), where new data will get uploaded going forward. We haven't processed this yet.

Hartford, CT

2013-10-13 to 2016-09-29

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 98.9%
lng 98.9%
district 93.3%
subject_age 100.0%
subject_race 100.0%
subject_sex 100.0%
officer_id 100.0%
officer_id_hash 100.0%
department_name 100.0%
type 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 86.3%
contraband_found 99.9%
search_conducted 100.0%
search_vehicle 100.0%
search_basis 99.9%
reason_for_stop 100.0%
raw_subject_race_code 100.0%
raw_subject_ethnicity_code 100.0%
raw_search_authorization_code 100.0%
raw_intervention_disposition_code 100.0%

Data notes:

  • Data is deduplicated on raw columns InterventionDateTime, ReportingOfficerIdentificationID, InterventionLocationDescriptionText, SubjectRaceCode, SubjectSexCode, and SubjectAge, reducing the number of rows by ~1.1%
  • search rate is suspiciously high, ~28%
  • hit rate is suspiciously low, ~1%; we exclude Hartford from outcome and threshold tests because contraband recovered is so suspiciously low that we don't trust it, plus it's so low that it's not even enough data to run the statistical tests reliably.
  • subject_race is based on SubjectEthnicityCode and SubjectRaceCode, which are based on raw_subject_ethnicity_code and raw_subject_race_code
  • search_conducted and search_basis are derived from SearchAuthorizationCode, which is passed through as raw_search_authorization_code
  • outcomes are based on InterventionDispositionCode, which is passed through as raw_intervention_disposition_code
  • 2013 and 2016 have only partial data

Tampa, FL

1973-06-21 to 2018-03-07

feature coverage rate
date 100.0%
subject_age 99.8%
subject_dob 99.7%
subject_race 100.0%
subject_sex 100.0%
officer_first_name 93.3%
officer_last_name 93.3%
department_name 100.0%
type 100.0%
violation 100.0%
citation_issued 100.0%
outcome 100.0%
vehicle_registration_state 97.6%
raw_race 100.0%

Data notes:

  • Data is deduplicated on date, subject_race, subject_dob, officer_last_name, officer_first_name, and Driver License Number, reducing the number of rows by ~13.2%; it's possible this slightly over-deduplicates, if an officer pulls over the same person in the same day
  • Data is missing search and contraband information, as well as outcomes other than citations
  • Hispanic race data is likely underreported, given that ACS 2017 5-year estimates suggest Hispanic individuals make up ~25% of the population, but only ~4% of stops in Tampa
  • The data sources are public (it's unclear what the difference is between the stop types):
  • subject_race is based on Race which is passed through as raw_race

Saint Petersburg, FL

2010-01-01 to 2010-07-29

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 89.7%
lng 89.7%
district 98.4%
officer_id 100.0%
officer_id_hash 100.0%
type 100.0%

Data notes:

  • Only 7 months of data provided
  • No demographic, search/contraband, or outcome data

Statewide, FL

2010-01-01 to 2018-12-31

feature coverage rate
date 100.0%
time 100.0%
location 16.9%
county_name 100.0%
subject_age 56.1%
subject_race 100.0%
subject_sex 56.3%
officer_id 100.0%
officer_id_hash 100.0%
officer_age 60.7%
officer_race 57.5%
officer_sex 64.1%
officer_last_name 56.3%
officer_years_of_service 68.5%
department_name 56.3%
unit 74.2%
type 100.0%
violation 94.2%
arrest_made 94.2%
citation_issued 98.0%
warning_issued 95.7%
outcome 76.4%
frisk_performed 94.2%
search_conducted 68.5%
search_basis 100.0%
reason_for_search 100.0%
reason_for_stop 94.2%
vehicle_registration_state 56.0%
notes 69.6%
raw_row_number_old 56.3%
raw_Race 100.0%
raw_Ethnicity 11.6%
raw_row_number_new 68.5%
raw_SearchType 94.2%
raw_EnforcementAction 94.2%

Data notes:

  • The raw data is very messy. Two different data sets were supplied, both with slightly different schemas, just for 2010 to part of 2016. A third dataset was supplied for 2016 through 2018. However, they were joined by uniquely identifying features. The second data dump goes until 2016, while the first only goes until 2015. The fields missing in the second or third data sets are thus missing for some rows.
  • There are many duplicates in the raw data, which we remove in two stages. First, we remove identical duplicate rows. Second, we group together rows which correspond to the same stop but to different violations or passengers.
  • The original data has a few parsing errors, but they don't seem important as they are spurious new lines in the last 'Comments' field.
  • The Florida PD clarified to us that both UCC Issued and DVER Issued in the raw_EnforcementAction column indicated citations, and we consequently coded them as such.
  • subject_race was mapped from raw_Ethnicity and raw_Race (the different data sets have different practices in terms of recording Hispanic in race vs ethnicity fields).
  • raw_SearchType was used to conclude search_conducted and search_basis.
  • While there is some data on whether items were seized, it is not clear if these are generally seized as a result of a search, and we thus do not define a contraband_found column for consistency with other states.
  • raw_EnforcementAction and notes were used to determine outcome.

Statewide, GA

2012-01-01 to 2016-12-31

feature coverage rate
date 100.0%
time 100.0%
location 34.9%
lat 99.1%
lng 99.1%
county_name 100.0%
subject_race 52.9%
subject_sex 96.1%
officer_id 100.0%
officer_id_hash 100.0%
department_name 100.0%
type 100.0%
violation 100.0%
outcome 100.0%
vehicle_color 98.1%
vehicle_make 99.0%
vehicle_model 96.0%
vehicle_year 95.2%
raw_race 52.9%

Data notes:

  • The data represent warnings.
  • The provided .txt was comma-separated, but not quoted. Therefore we had to write a script (convert_GA.py) to iron out some obviously misaligned columns.
  • Rows represent individual warnings, and thus need to be aggregated to represent a single stop.
  • The race field on the warnings form is optional; we have only about 50% race coverage, so GA is omitted from all analyses.
  • subject_race was mapped from raw_race.

Statewide, IA

2006-01-01 to 2016-04-25

feature coverage rate
date 100.0%
time 84.7%
location 89.2%
county_name 4.6%
subject_age 39.3%
subject_race 26.0%
subject_sex 39.2%
officer_id 57.8%
officer_id_hash 57.8%
department_name 57.8%
type 85.2%
violation 92.5%
citation_issued 84.4%
warning_issued 84.4%
outcome 84.4%
vehicle_color 53.8%
vehicle_make 54.8%
vehicle_model 52.7%
vehicle_registration_state 38.7%
vehicle_year 38.3%

Data notes:

  • The data separates warnings and citations. They are very different with respect to which fields they have available. Both contain duplicates. This happens when individuals receive more than one warning or citation within the same stop. We remove these by grouping by the remaining fields by the stop key and date.
  • In some cases, there are multiple time stamps per unique (key, date) combination. In most of these cases, the timestamps differ by a few minutes, but all other fields (except for violation) are the same. In 0.1% of stops, the max span between timestamps is more than 60 minutes. In those cases it looks like the same officer stopped the same individual more than once in the same day.
  • Only citations have Ethnicity, which only provides information on whether the driver is Hispanic. We therefore exclude Iowa from our main analysis because race data is lacking.
  • Only (some) citations have county, the warnings only have trooper district. The mapping for the districts is provided in the resources folder. Counties were mapped by comparing the identifiers in the LOCKCOUNTY field with the cities in the LOCKCITY field.
  • The codes in the county field represent counties ordered alphabetically.
  • Additional columns in the raw data that may be of interest: EQUIPVIOL (free field -- usually a two-digit code, but some text descriptions of the violation), SCHEDULEDFINE, SURCHARGE, TOTALCOST, along with a bunch of columns that are >=99.9% NA, many of which have prefix "DISP" and pertain to data that would happen post-arrest. )

Idaho Falls, ID

2008-08-13 to 2016-07-25

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 84.3%
lng 84.3%
neighborhood 93.9%
division 59.8%
subdivision 100.0%
zone 92.3%
officer_id 100.0%
officer_id_hash 100.0%
type 100.0%
disposition 31.8%

Data notes:

  • Race and gender are not on the ID driver's license and filled in only rarely, subject age is also 100% null
  • There is 'reptspec' data, but the values are extrenely vague, i.e. "PAST", "SATURATION", "PERSON", "OTHER AGENCY",
  • There are 6 more months of data unprocessed with the main files since they are of a completely different format, but are available upon request
  • The data is missing demographic information as well as search/contraband information
  • It's unclear whether there are duplicates, since officerid is 0 sometimes and there is no demographic information

Statewide, IL

2012-01-01 to 2017-12-31

feature coverage rate
date 100.0%
time 99.9%
location 99.0%
beat 98.2%
subject_age 99.9%
subject_yob 100.0%
subject_race 100.0%
subject_sex 100.0%
department_id 100.0%
department_name 100.0%
type 100.0%
violation 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 100.0%
contraband_found 99.2%
contraband_drugs 100.0%
contraband_weapons 100.0%
search_conducted 99.9%
search_person 99.9%
search_vehicle 99.9%
search_basis 32.7%
reason_for_stop 100.0%
vehicle_make 97.4%
vehicle_year 99.9%
raw_DriverRace 100.0%
raw_ReasonForStop 100.0%
raw_TypeOfMovingViolation 100.0%
raw_ResultOfStop 100.0%

Data notes:

  • The data is very messy. The presence and meaning of fields relating to search and contraband vary year by year. Caution should be used when inspecting search and hit rates over time. We exclude Illinois from our time trend marijuana analysis for this reason.
  • We only process statewide data from 2012 to 2017. We received data back to 2004, but chose not to process it due to format issues and relevance.
  • For state patrol stops, we used police district (see the beat column), which have a one-to-many relationship with counties; that is, a single district covers multiple counties. See the relevant map here. There is one district (#15) with a lot of stops that does not directly map to counties, as it refers to stops made on the Chicago tollways. Note that while we use districts in our analysis, zipcode can be extracted from the data and mapped to county, if desired.
  • Counties for local stops could be mapped by running the police departments in the AgencyName field through Google's geocoder.
  • For both state patrol and local stops, zipcode can be extracted and used to map to county, if needed.
  • The search_type_raw field is occasionally "Consent search denied", when a search was conducted. This occurs because the search request might be denied but a search was conducted anyway. Many searches have missing search type data, so we do not rely on search_basis when analyzing Illinois searches.
  • Race (raw_DriverRace) mapping:
    • 1 = White
    • 2 = Black
    • 3 = American Indian or Alaska Native
    • 4 = Hispanic
    • 5 = Asian
    • 6 = Native Hawaiian or Other Pacific Islander
  • Outcome (raw_ResultOfStop) mapping:
    • 1 = Citation
    • 2 = Written Warning
    • 3 = Verbal Warning (stop card)
  • We also pull through raw columns raw_ReasonForStop and raw_TypeOfMovingViolation to populate the reason_for_stop and violation columns in the clean data. We received dictionaries to help do so.
  • Note that IL contains state patrol and municipal police departments, but we use only the state patrol data in our anlaysis. There are occasional issues with some of the municipal P.D. data to watch out for: for example, the search and contraband data is fairly detailed and robust, except for Chicago Police, which has lots of NAs for search info (in 2012-2013) and lots of NAs for contraband info (in 2014). We do not alter these NA values, but recommend looking more closely into the Chicago city data (see below) rather than using the data given to us through the state records request.
  • Additional columns in the raw data that may be of interest: IL has really detailed search/contraband information. There are about 40 raw columns with search/contraband info; they fall into four categories Vehicle*, Driver*, Passenger*, and PoliceDog*, where * delineates things like what type of contraband was found or how much contraband was found, whether consent was requested, whether consent was given, who performed the search, etc.

Chicago, IL

2012-01-01 to 2016-12-31

feature coverage rate
date 100.0%
time 100.0%
location 91.6%
lat 75.7%
lng 75.7%
subject_age 25.1%
subject_race 26.5%
subject_sex 100.0%
officer_id 18.0%
officer_id_hash 18.0%
officer_age 7.1%
officer_race 93.0%
officer_sex 93.0%
officer_first_name 93.0%
officer_last_name 93.0%
officer_years_of_service 92.7%
type 100.0%
violation 100.0%
arrest_made 25.1%
citation_issued 75.6%
outcome 100.0%
raw_race 25.1%
raw_driver_race 1.4%

Data notes:

  • Dataset is created by joining arrests and citations on date, hour, officer name, and location
  • There may be duplicates, but there is often insufficient information to deduplicate, i.e. the time resolution is hourly driver_race is null 99% of the time
  • Data includes warnings and arrests, but is missing warnings
  • violation represents statute_description in the raw data
  • subject_race is based on raw columns race and driver_race, which are passed through

Fort Wayne, IN

2007-09-01 to 2017-09-30

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 97.0%
lng 97.0%
officer_first_name 99.8%
officer_last_name 99.9%
type 100.0%
disposition 99.4%
arrest_made 99.4%
citation_issued 99.4%
warning_issued 99.4%
outcome 66.3%

Data notes:

  • Roster.csv (police officer info) is available in raw data, but doesn't join cleanly to stops data; first names are often truncated and nicknames are used, i.e. Manny vs Manuel; it can be loaded and reviewed upon request.
  • Data is missing search/contraband information, as well as demographic information
  • disposition represents Description in the raw data; outcomes are derived from this column

Wichita, KS

2006-01-01 to 2016-12-31

feature coverage rate
date 100.0%
time 100.0%
location 97.3%
lat 97.1%
lng 97.1%
subject_age 81.1%
subject_race 97.7%
subject_sex 82.2%
officer_first_name 99.5%
officer_last_name 99.5%
type 100.0%
disposition 97.5%
violation 100.0%
citation_issued 100.0%
outcome 100.0%
posted_speed 30.6%
vehicle_color 93.8%
vehicle_make 95.2%
vehicle_model 46.5%
vehicle_year 27.7%
raw_defendant_race 97.7%
raw_defendant_ethnicity 69.0%

Data notes:

  • Data is deduplicated on raw columns citation_date_time, citation_location, defendant_first_name, defendant_last_name, defendant_age, defendant_sex, and defendant_race, resulting in ~4.1% fewer records
  • Data is missing search/contraband fields
  • citation_number in the raw data doesn't appear to be unique. i.e. citation "07M000645" is associated with two different dates, locations, and people
  • Only citations are included
  • violation represents charge_description in the raw data
  • disposition represents charge_disposition in the raw data
  • subject_race is based on the raw columns defendant_ethnicity and defendant_race, which are passed through

Louisville, KY

2015-01-01 to 2018-01-28

feature coverage rate
date 100.0%
time 100.0%
location 99.9%
lat 99.6%
lng 99.6%
beat 96.3%
division 96.2%
subject_age 73.2%
subject_race 100.0%
subject_sex 100.0%
officer_race 99.9%
officer_sex 99.9%
type 100.0%
violation 73.3%
citation_issued 100.0%
warning_issued 100.0%
outcome 100.0%
frisk_performed 100.0%
search_conducted 100.0%
search_basis 100.0%
reason_for_search 99.8%
raw_activity_division 96.2%
raw_division 69.4%
raw_activity_beat 96.1%
raw_beat 69.6%
raw_driver_race 100.0%
raw_persons_race 73.2%
raw_persons_ethnicity 71.2%
raw_driver_age_range 100.0%
raw_was_vehcile_searched 100.0%
raw_citation_location 73.3%

Data notes:

  • While we have raw csvs for all citations, we keep only those records that join onto the stops data; the source of this data is here: https://data.louisvilleky.gov/dataset/uniform-citation-data
  • Data is deduplicated on raw columns officer_gender, officer_race, officer_age_range, activity_date, activity_time, activity_location, activity_division, division, activity_beat, beat, driver_gender, persons_sex, driver_race, persons_race, persons_ethnicity, driver_age_range, person_age, persons_home_city, persons_home_state, person_home_zip, reducing the number of rows by ~%
  • subject_race is based on the raw column driver_race, since it is null 0.03% of the time compared to 18.6% for persons_race and 18.60% for persons_ethnicity; all are passed through with raw_ prefix
  • violation represents raw column charge_desc
  • All stops are not null for at least one of the driver_* columns or number_of_passengers or was_vehicle_searched columns, implying all stops are vehicular
  • location used for geocoding is activity_location, which had a lower null rate than citation_location, but the latter was passed through as raw_citation_location
  • subject_age is based on persons_age from the citation data, although it is null more often than driver_age_range; the latter, however, only gives a range, so couldn't be use for this column; it is passed through though as raw_driver_age_range
  • search_conducted is based on was_vehcile_searched, which is passed through as raw_was_vehcile_searched (sic); there were 3 NAs that were coerced to false under the assumption that the officers may simply not have recorded the absence of a search
  • search_basis was based on reason_for_search; k9 searches matched the pattern "K9|K-9|DOG", plain view searches matched anything mentioning plain view/smell or anything that could be seen in plain sight and matched the following pattern "BAGGIES|DRUGS|GUN|MARIJUANA|ODOR|PILLS|PIPE|PLAIN VIEW|SMELL", consent matched "CONSENT|CONSE", probable cause matched "PROB|P/C|PC|P.C.", and everything else was classified as "other"; this was verified to be accurate for 99.8% of entries; the long tail was not checked, but anyone viewing the data can see the original values in the reason_for_search column
  • data is lacking explicit contraband information, but some of this can be inferred from reason_for_search
  • frisk_performed is true with reason_for_search matches the pattern "TERRY|PAT", it is false otherwise (NA and no match)
  • 2018 has data only from January

Owensboro, KY

2015-09-01 to 2017-09-01

feature coverage rate
date 100.0%
time 100.0%
location 99.9%
lat 100.0%
lng 99.9%
sector 99.9%
subject_age 100.0%
subject_dob 99.9%
subject_race 99.7%
subject_sex 100.0%
officer_id 100.0%
officer_id_hash 100.0%
type 99.4%
violation 99.9%
arrest_made 100.0%
citation_issued 100.0%
outcome 100.0%
vehicle_registration_state 99.2%
raw_race 99.7%

Data notes:

  • There is a list_of_officers.csv as well as the excel spreadsheet (preferable given the formatting) that have more officer information available upon request
  • Data is missing search/contraband information
  • Data is all citations, although it appears to include an arrest indicator as well, when that also occurred
  • Provided longitude is lacking the negative sign, which we add (without which all points are in central China)
  • subject race is based on RACE in the raw data and passed through as raw_race; data does not include Hispanic.
  • violation is a concatenation of Violation Description X where X is 1 to 9
  • type is based on Violation Description 1
  • 2015 and 2017 only have data for part of the year

New Orleans, LA

2010-01-01 to 2018-07-18

feature coverage rate
date 100.0%
time 100.0%
location 81.3%
lat 50.8%
lng 50.8%
district 100.0%
zone 100.0%
subject_age 97.5%
subject_race 97.7%
subject_sex 97.7%
officer_assignment 100.0%
type 70.7%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 65.5%
contraband_found 100.0%
contraband_drugs 100.0%
contraband_weapons 100.0%
frisk_performed 100.0%
search_conducted 100.0%
search_person 100.0%
search_vehicle 100.0%
search_basis 100.0%
reason_for_stop 100.0%
vehicle_color 53.3%
vehicle_make 54.0%
vehicle_model 50.6%
vehicle_year 53.1%
raw_actions_taken 76.1%
raw_subject_race 97.7%

Data notes:

  • Data is deduplicated on EventDate, BlockAddress, and SubjectID, which reduces the number of rows by ~0.07%
  • Addresses were partially anonymized by the department replacing the last two numbers of the address number with XX; these were replaced with 00 so we could at least geocode the block level address
  • search_conducted is true when the ActionsTaken includes "Search Occurred: Yes", and it's false when that is not present or the ActionsTaken column is NA, under the assumption that NA is equivalent to "Stop Results: No action taken"
  • reason_for_stop is StopDescription in the raw data; type is based on this column
  • outcomes, search, and contraband fields are all based on the ActionsTaken column, which is passed through as raw_actions_taken; NA in this column is assumed to be 'no actions taken'
  • subject_race is based on SubjectRace raw column, which is passed through as raw_subject_race
  • data before 2010 is sparse and unreliable so it is removed from the clean dataset
  • 2018 only has partial data

Statewide, MA

2007-01-01 to 2015-12-31

feature coverage rate
date 100.0%
location 99.8%
county_name 99.8%
subject_age 95.4%
subject_race 100.0%
subject_sex 99.5%
type 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 99.8%
contraband_found 100.0%
contraband_drugs 100.0%
contraband_weapons 100.0%
contraband_alcohol 100.0%
contraband_other 1.6%
frisk_performed 1.6%
search_conducted 100.0%
search_basis 91.6%
reason_for_stop 51.4%
vehicle_type 99.9%
vehicle_registration_state 99.7%
raw_Race 100.0%

Data notes:

  • The search and outcome fields are inconsistent. We take the most progressive interpretation: if one of SearchYN, SearchDescr or the outcome columns indicates that there was a search, we label them as such.
  • While we define a contraband_found column in case it is useful to other researchers, it is sufficiently messy (there are multiple ways you might define contraband_found, and they are quite inconsistent) that we exclude it from our contraband analysis.
  • In <1% of the data, RsltSrchNo and RsltSrch<contraband type> conflict. In these cases, we use the value from RsltSrchNo.
  • Violation data is not very granular.
  • Counties were mapped by running the cities in the CITY_TOWN_NAME field through Google's geocoder.
  • There are only a handful of stops in the data before 2007; we drop those years as they are clearly unreliable. It appears that the first few months (nearly half) of 2007 are also incomplete, but we have not attempted to remove the incomplete months.
  • subject_race was mapped from raw_Race
  • Additional columns in the raw data that may be of interest: SpecialEvent (GHSB Speed Detail, Road Block, Blue Blitz; 99% NA), PlateReader (boolean), OwnTruckPass (O, W, T, P)

Baltimore, MD

2011-01-01 to 2017-12-30

feature coverage rate
date 100.0%
time 98.7%
beat 63.1%
district 60.1%
officer_id 89.9%
officer_id_hash 89.9%
type 97.2%
citation_issued 100.0%
outcome 100.0%

Data notes:

  • Data is missing search/contraband information as well as demographic information and outcomes other than citations
  • The primary key seems to be a combination of Ticket and Citation Number; when Ticket is null, Citation Number isn't and vice versa; both are duplicated across rows, so we deduplicate on those two IDs coalesced, resulting in ~0.01% fewer records
  • Data lacks translations for Ordinance Code and Citation Type
  • Violation data is almsot all null

Statewide, MD

2007-01-01 to 2014-03-31

feature coverage rate
date 97.8%
time 23.0%
location 23.0%
subject_age 22.9%
subject_dob 22.9%
subject_race 99.6%
subject_sex 98.8%
department_name 100.0%
type 100.0%
disposition 1.9%
violation 0.1%
arrest_made 94.0%
citation_issued 93.2%
warning_issued 93.2%
outcome 78.7%
contraband_found 82.0%
contraband_drugs 99.6%
contraband_weapons 98.5%
search_conducted 100.0%
search_person 100.0%
search_vehicle 100.0%
search_basis 14.8%
reason_for_arrest 87.9%
reason_for_search 98.5%
reason_for_stop 99.2%
raw_Race 99.6%
raw_Outcome 93.2%
raw_Arrest_Made 28.8%

Data notes:

  • The data is very messy. It comes from three different time periods: 2007, 2009-2012, 2013-2014. They all have different column and slightly different conventions of how things are recorded. We attempted to standardize the fields as much as possible.
  • Time resolution of the data varies by year. Prior to 2013, data is reported annually. From 2013 onward, data is reported daily. So stop dates prior to 2013 are not precise to the nearest day and are just reported as Jan 1.
  • Counties could theoretically be mapped by running the police departments in the Agency field through Google's geocoder, but this does not work for state patrol stops, for which we have no county information. Maryland's data is not good enough for us to include in our analysis, so we chose not to do this.
  • subject_race is mapped from raw_Race.
  • outcome and arrest_made are mapped from raw_Outcome and raw_Arrest Made; see processing script for details.
  • search_basis is a cleaned up version of reason_for_search which is a free field populated by raw column Search Reason.
  • Prior to 2013, there are quite a few NAs for contraband; we do not cast these to false because it seems to be too many to assume they're all false -- it feels more believable that there is actual missing data in these annually reported, messy datasets.
  • Additional columns from the raw data that may be of interest: Duration of Search

Statewide, MI

2001-07-06 to 2016-05-09

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
county_name 100.0%
subject_race 97.8%
officer_id 100.0%
officer_id_hash 100.0%
department_id 100.0%
department_name 100.0%
type 100.0%
violation 99.9%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 100.0%
reason_for_stop 100.0%
speed 30.5%
posted_speed 30.5%
charged_speed 91.7%
raw_Race 97.8%

Data notes:

  • The original data had some unquoted fields (VoidReason and Description) which had commas in them. We manually fixed these with a python script, which can be found in the /scripts folder.
  • Driver race data has more than 50% missing data, so we excluded Michigan from the analysis in the paper.
  • The codes in the CountyCode field represent counties ordered alphabetically.
  • Rows represent violations, not stops, so we remove duplicates by grouping by the other fields.
  • Michigan data has loads of additional columns, a cluster we find very interesting are SpeedPosted (pulled through as posted_speed), SpeedDetected (pulled through as speed), and SpeedCharged (pulled through as charged_speed). Most places with speeding information give just speed and posted speed; analyses like the bunching analysis try to infer the true speed, and whether drivers of different races were discounted at different rates. Michigan's transparency about discounting (from detected speed to charged speed) could make this process much easier to analyze. However, we do not do so because race information is insufficient.
  • Since all rows have a TicketNum, we assume that if any ticket is not a warning, then it is a citation. But then potentially for outcome, anything that is not an arrest or warning could have a court summons. It's possible raw data columns Felony, Misdemeanor, CivilInfraction could help disambiguate.
  • Additional raw data columns that may be of interest: Michigan has over 160 columns in the raw data, though many of them are >99.9% NA. There are ID columns for everything from violation codes, citation codes, infraction codes, incident numbers, court code, etc. Other columns: VehicleImpounded, Injury, Felony, Misdemeanor, CivilInfraction.

Saint Paul, MN

2001-01-01 to 2016-12-13

feature coverage rate
date 100.0%
time 100.0%
lat 100.0%
lng 100.0%
police_grid_number 100.0%
subject_age 13.2%
subject_race 82.5%
subject_sex 84.2%
type 100.0%
citation_issued 100.0%
outcome 13.9%
frisk_performed 100.0%
search_conducted 100.0%
search_vehicle 100.0%
raw_race_of_driver 100.0%

Data notes:

  • Data is deduplication on DATE OF STOP, RACE OF DRIVER, AGE OF DRIVER, GENDER OF DRIVER, and POLICE GRID NUMBER, resulting in ~0.02% fewer records
  • Data is lacking contraband and location information
  • If a citation was not issued, it's unclear whether a warning was issued or something else
  • subject_race is based on RACE OF DRIVER in the raw data, which is passed through as raw_race_of_driver
  • search_conducted is based on VEHICLE SEARCHED?; "No Data" is assumed to be false because it is likely that "No Data" is an autofill value for NA, which we coerce to false elsewhere under the assumption that officers sometimes don't record the absence of a search; the same is done for frisk_performed

Statewide, MO

2010-01-01 to 2015-01-01

feature coverage rate
date 100.0%
location 100.0%
subject_race 100.0%
department_name 100.0%
type 100.0%
contraband_found 100.0%
search_conducted 100.0%
raw_race 100.0%

Data notes:

  • The original data was aggregated. There is detail on a number of fields (age, stop purpose, outcome) that is not usable as it is not cross-tabulated with the other fields.
  • Because this is aggregate data, stop date is only precise to the nearest year, and is recorded as Jan 1 for all stops.
  • Note that the location column comes from the department's work location, which is coarse; and highway patrol stops thus all get mapped to Jefferson City.

Statewide, MS

2013-01-01 to 2016-07-27

feature coverage rate
date 100.0%
county_name 99.4%
subject_age 100.0%
subject_dob 99.9%
subject_race 99.9%
subject_sex 99.9%
department_id 100.0%
department_name 99.2%
type 100.0%
violation 100.0%
speed 34.9%
posted_speed 34.9%
raw_race 99.9%

Data notes:

  • Counties were mapped using the dictionary provided, which is added to the raw data folder. Counties are numbered alphabetically.
  • There is no data on Hispanic drivers, so we exclude Mississippi from our main analysis.
  • subject_race was mapped from raw_race.
  • violation was populated with raw column acd.
  • Additional columns in the raw data that may be of interest: acdoos and aamva have alpha-numeric codes like acd (i.e., violation in the clean data), acdsev (0-3; NA for 65%), acc (boolean), court (mostly MUN or JUS), elect (E or NA), disp (G, P, D, S, N), fine.

Statewide, MT

2009-01-01 to 2016-12-31

feature coverage rate
date 100.0%
time 100.0%
location 99.6%
lat 100.0%
lng 100.0%
county_name 100.0%
subject_age 99.6%
subject_race 100.0%
subject_sex 100.0%
department_name 100.0%
type 100.0%
violation 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 100.0%
frisk_performed 100.0%
search_conducted 100.0%
search_basis 96.6%
reason_for_stop 100.0%
vehicle_make 99.5%
vehicle_model 97.1%
vehicle_type 92.1%
vehicle_registration_state 96.2%
vehicle_year 99.1%
raw_Race 100.0%
raw_Ethnicity 100.0%
raw_SearchType 100.0%

Data notes:

  • subject_race was mapped from raw_Ethnicity and raw_Race.
  • search_conducted and search_basis were mapped from raw_SearchType.
  • violation is a concatenation of Violation[1-3] from the raw data.
  • stop_outcome is derived from raw columns EnforcementAction[1-3], see processing script for details.
  • reason_for_stop is populated from raw column ReasonForStop.
  • Additional columns in the raw data that may be of interest: VehicleIsCommercial, VehicleIsMotorcycle, ViolationDescription (which gives a bit more detail than the violation columns we pull through into the clean data), ViolationUnlawfulSpeed (boolean), AggressiveDriving (boolean), FaultyOtherDescription (free field description of equipment violations), WarningOtherViolations[1,2] (free field description of warning), WarningsThisRecord (0-3, indicating how many warnings were given), CitationsThisRecord (0-3 indicating how many citations were given), EnforcementAction[1-3] (gives slightly more detail than stop_outcome, e.g., misdemeanor arrest vs felony arrest).

Raleigh, NC

2002-01-01 to 2015-12-31

feature coverage rate
date 100.0%
time 99.9%
location 100.0%
county_name 100.0%
subject_age 100.0%
subject_race 100.0%
subject_sex 100.0%
officer_id 100.0%
officer_id_hash 100.0%
department_name 100.0%
type 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 97.8%
contraband_found 100.0%
contraband_drugs 100.0%
contraband_weapons 100.0%
frisk_performed 100.0%
search_conducted 100.0%
search_person 100.0%
search_vehicle 100.0%
search_basis 98.4%
reason_for_frisk 0.1%
reason_for_search 100.0%
reason_for_stop 100.0%
raw_Ethnicity 100.0%
raw_Race 100.0%
raw_action_description 100.0%

Data notes:

  • Data is pulled out of Statewide, NC data, so refer to that for processing documentation
  • Missing data 2/2004, 2/2005, 5/2005, 10/2005, 11/2005, 3/2006, 8/2006, 4/2007, 11/2008, 1/2009, 11/2012, 9/2013, 11/2013, 7/2014, 10/2014, 10/2015

Statewide, NC

2000-01-01 to 2015-12-31

feature coverage rate
date 100.0%
time 49.7%
location 99.7%
county_name 97.9%
subject_age 100.0%
subject_race 100.0%
subject_sex 100.0%
officer_id 100.0%
officer_id_hash 100.0%
department_name 100.0%
type 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 97.0%
contraband_found 100.0%
contraband_drugs 100.0%
contraband_weapons 100.0%
frisk_performed 100.0%
search_conducted 100.0%
search_person 100.0%
search_vehicle 100.0%
search_basis 96.4%
reason_for_frisk 0.1%
reason_for_search 100.0%
reason_for_stop 100.0%
raw_Ethnicity 100.0%
raw_Race 100.0%
raw_action_description 100.0%

Data notes:

  • Stop time is often unreliable — we have a large overdensity of 00:00 values, which we set to NA.
  • Attempting to deduplicate on StopDate, OfficerId, StopLocation, StopCity, PersonID, Age, Gender, Ethnicity, and Race reduced rows by 0%, i.e. there do not appear to be duplicates
  • The location of the stop is recorded in two different ways. Some stops have a county code, which can be mapped using the provided dictionary, which is included in the raw data. Other stops are only labeled with the state patrol district. Some districts map directly onto counties, in which case we label the stop with that county. However, some districts cover multiple counties. Stops in these districts can thus not be unambiguously mapped to a single county. In both cases, district of the stop is provided in the "district" column, providing coarse location data for the vast majority of stops.
  • Action is sometimes "No Action" or a similarly minor enforcement action even when DriverArrest or PassengerArrest is TRUE. In these cases, we set outcome to be "Arrest" because the outcome field represents the most severe outcome of the stop.
  • There can be multiple search bases per stop-search-peron, so we collapse them into a single value
  • There is a 1:N correspondence between StopID and PersonID, so we filtered out passengers when joining demographic information to stop data to prevent duplicates; this also means that the demographic information pertains to the driver
  • When joining search data onto the stop data, the data is joined by StopID only and not also PersonID, since the person searched could be either the driver or passenger; this means that the search data may be of either the driver or the passenger, and in 3.6% of cases, it was actually the passenger who was searched, but search_conducted is true in either case; fortunately, there is a 1:1 correspondence between StopID and and SearchID, as well as between SearchID and PersonID (who, again, can be either the driver or passenger) and SearchID and ContrabandID
  • subject_race is based on Ethnicity and Race, which are passed through as raw_*
  • outcomes are based on raw_action_description, which is based on the raw column Action and translated given the provided codes
  • frisk and search data is based on SearchID and search_type_description, which is passed through with raw_*; the latter is based on the raw column SearchType and translated using the given data dictionary
  • stop_purpose_description is based on raw column StopPurpose and is translated using the given data dictionary and passed through as reason_for_stop
  • reason_for_search represents the raw column Basis
  • Additional columns in the raw data that may be of interest: Ounces, Pounds, Kilos, Grams, Dosages, Weapons provide greater resolution on contraband; Gallons, Pints, Money, DollarAmt may also do so; EncounterForce (boolean), EngageForce (boolean); [Officer,Driver,Passenger]Inury (all booleans)

Winston-Salem, NC

2000-01-11 to 2015-12-31

feature coverage rate
date 100.0%
time 78.6%
location 100.0%
county_name 100.0%
subject_age 100.0%
subject_race 100.0%
subject_sex 100.0%
officer_id 100.0%
officer_id_hash 100.0%
department_name 100.0%
type 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 97.9%
contraband_found 100.0%
contraband_drugs 100.0%
contraband_weapons 100.0%
frisk_performed 100.0%
search_conducted 100.0%
search_person 100.0%
search_vehicle 100.0%
search_basis 99.0%
reason_for_frisk 0.0%
reason_for_search 100.0%
reason_for_stop 100.0%
raw_Ethnicity 100.0%
raw_Race 100.0%
raw_action_description 100.0%

Data notes:

  • Data is pulled out of Statewide, NC data, so refer to that for processing documentation
  • Missing data 8/2014, 1/2015, 2/2015, and 5/2015

Greensboro, NC

2000-01-04 to 2015-12-31

feature coverage rate
date 100.0%
time 99.1%
location 100.0%
county_name 99.8%
subject_age 100.0%
subject_race 100.0%
subject_sex 100.0%
officer_id 100.0%
officer_id_hash 100.0%
department_name 100.0%
type 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 97.2%
contraband_found 100.0%
contraband_drugs 100.0%
contraband_weapons 100.0%
frisk_performed 100.0%
search_conducted 100.0%
search_person 100.0%
search_vehicle 100.0%
search_basis 97.7%
reason_for_frisk 0.1%
reason_for_search 100.0%
reason_for_stop 100.0%
raw_Ethnicity 100.0%
raw_Race 100.0%
raw_action_description 100.0%

Data notes:

  • Data is pulled out of Statewide, NC data, so refer to that for processing documentation
  • Missing data 8/2015, 11/2015, 11/2016, and 3/2014

Durham, NC

2001-12-28 to 2015-12-31

feature coverage rate
date 100.0%
time 85.2%
location 100.0%
county_name 100.0%
subject_age 100.0%
subject_race 100.0%
subject_sex 100.0%
officer_id 100.0%
officer_id_hash 100.0%
department_name 100.0%
type 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 96.7%
contraband_found 100.0%
contraband_drugs 100.0%
contraband_weapons 100.0%
frisk_performed 100.0%
search_conducted 100.0%
search_person 100.0%
search_vehicle 100.0%
search_basis 96.4%
reason_for_frisk 0.2%
reason_for_search 100.0%
reason_for_stop 100.0%
raw_Ethnicity 100.0%
raw_Race 100.0%
raw_action_description 100.0%

Data notes:

  • Data is pulled out of Statewide, NC data, so refer to that for processing documentation
  • Missing data from 2008-2013:
    • 2008 missing January data
    • 2009 missing February, April, July, September, October, December
    • 2010 missing February, November
    • 2013 missing May

Fayetteville, NC

2000-01-07 to 2015-12-31

feature coverage rate
date 100.0%
time 96.8%
location 100.0%
county_name 100.0%
subject_age 100.0%
subject_race 100.0%
subject_sex 100.0%
officer_id 100.0%
officer_id_hash 100.0%
department_name 100.0%
type 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 97.5%
contraband_found 100.0%
contraband_drugs 100.0%
contraband_weapons 100.0%
frisk_performed 100.0%
search_conducted 100.0%
search_person 100.0%
search_vehicle 100.0%
search_basis 95.3%
reason_for_frisk 0.2%
reason_for_search 100.0%
reason_for_stop 100.0%
raw_Ethnicity 100.0%
raw_Race 100.0%
raw_action_description 100.0%

Data notes:

  • Data is pulled out of Statewide, NC data, so refer to that for processing documentation

Charlotte, NC

2000-01-01 to 2015-12-31

feature coverage rate
date 100.0%
time 99.9%
location 100.0%
county_name 99.9%
subject_age 100.0%
subject_race 100.0%
subject_sex 100.0%
officer_id 100.0%
officer_id_hash 100.0%
department_name 100.0%
type 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 95.5%
contraband_found 100.0%
contraband_drugs 100.0%
contraband_weapons 100.0%
frisk_performed 100.0%
search_conducted 100.0%
search_person 100.0%
search_vehicle 100.0%
search_basis 98.2%
reason_for_frisk 0.1%
reason_for_search 100.0%
reason_for_stop 100.0%
raw_Ethnicity 100.0%
raw_Race 100.0%
raw_action_description 100.0%

Data notes:

  • Data is pulled out of Statewide, NC data, so refer to that for processing documentation

Grand Forks, ND

2007-01-01 to 2016-12-31

feature coverage rate
date 100.0%
time 100.0%
location 96.8%
lat 93.5%
lng 93.5%
subject_race 99.0%
subject_sex 100.0%
type 63.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 100.0%
reason_for_stop 100.0%
raw_race 99.0%

Data notes:

  • Data is deduplicated on raw columns agency, date, time, sex, race, age, ht_ft, ht_in, house, and street, reducing the number of records by ~14.2%
  • Many of the offenses fall into categories other than obvious pedestrian or vehicular stops, i.e. BARKING DOG, and are encoded as NA for type, but the description is provided in reason_for_stop
  • The department says that arrest, search, and contraband are not recorded with stop data
  • There are unidentified spikes that are relatively large every year in late May or early June, i.e. 2010-05-08, 2011-06-02, 2012-05-05, 2013-05-04, 2014-05-10, 2015-05-09, 2016-05-20; it's unclear what these correspond to and the PD has not yet responded to our inquiry
  • subject_race is based on raw_race, which is passed through; the data does not appear to include Hispanic.

Statewide, ND

2010-01-01 to 2015-06-25

feature coverage rate
date 100.0%
time 100.0%
location 99.8%
county_name 100.0%
subject_age 99.9%
subject_race 100.0%
subject_sex 100.0%
type 99.2%
violation 100.0%
outcome 100.0%
raw_Race 100.0%

Data notes:

  • The data contain records only for citations, not warnings.
  • Rows represent individual citations, not stops, so we remove duplicates by grouping by the other fields.
  • The violation field is populated by citation codes and their descriptions.
  • subject_race is mapped from raw_Race.
  • Note that deduping by violation_date_time, Age, sex, Race, county_name, street_cnty_rd_location, desc_of_area, highway, ref_point reduces rows by ~16.6%.

Statewide, NE

2002-01-01 to 2016-10-01

feature coverage rate
date 100.0%
county_name 47.7%
subject_race 100.0%
department_name 100.0%
type 100.0%
search_conducted 100.0%
raw_dept_lvl 100.0%
raw_dept 100.0%
raw_Race 100.0%

Data notes:

  • The original data was aggregated. It was grouped by stop reason, outcome and whether there was a search separately. Therefore, it is not possible to cross tabulate them together. We only use the last grouping.
  • State and local stops are mixed together, identifiable by the raw_dept_lvl field. We map levels 1, 5, 9, 10, 11 to "Nebraska State Agency" in the deparment_name field; for the other levels, we fill department_name with raw_dept. Note that levels 1, 2, and 3 are state patrol, local P.D. and sheriff P.D.s, respectively; levels 5-12 are special agencies or sectors of some sort; there are no stops for level 4.
  • The data is by quarter, not by day. So all stop_dates are the first date of the quarter.
  • There is a strange jump (Q1) and then dip (Q2–4) in the data for 2012. This stems from all state patrol stops for 2012 being recorded as happening in the first quarter. Municipal departments seem to have okay dated data for 2012.

Statewide, NH

2014-01-01 to 2015-12-31

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 87.7%
lng 87.7%
county_name 100.0%
subject_age 52.9%
subject_dob 52.9%
subject_race 63.9%
subject_sex 98.4%
type 100.0%
violation 94.1%
citation_issued 100.0%
warning_issued 100.0%
outcome 94.3%
raw_RACE_CDE 64.0%
raw_CITATION_RESPONSE_DSC 100.0%

Data notes:

  • The driver_race field was populated by hand-written codes that we manually decoded. They are prone to mislabeling and should be used with caution only. Also, a very high percentage of stops (>30%) are missing race data entirely. We map the most common codes, covering more than 99% of stops with data, but we do not interpret the long tail of misspellings because many of them are ambiguous, we do not want to make assumptions, and it does not significantly improve the data. We exclude this dataset from our analysis because it has too much missing race data.
  • We determine stop outcome (citation, warning, etc) using raw_CITATION_RESPONSE_DSC, and we determine subject_race from raw_RACE_CDE.
  • The driver_age field was not populated for the 2014.2 dataset.
  • Rows represent violations, not stops, so we remove duplicates by grouping by the other fields.

Statewide, NJ

2009-01-01 to 2016-12-31

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
subject_race 3.9%
subject_sex 99.5%
officer_id 100.0%
officer_id_hash 100.0%
department_id 100.0%
type 100.0%
violation 77.0%
arrest_made 3.3%
citation_issued 77.8%
warning_issued 77.7%
outcome 78.0%
contraband_found 95.6%
frisk_performed 3.6%
search_conducted 3.7%
vehicle_color 97.3%
vehicle_make 96.3%
vehicle_model 24.9%
vehicle_registration_state 99.3%
raw_TOWNSHIP 100.0%
raw_RACE 100.0%
raw_Ethnicity 3.9%

Data notes:

  • New Jersey data may be updated: we still have a number of questions we are waiting on the state to answer.
  • New Jersey uses sofware produced by LawSoft Inc.. There are two sets of data: CAD (computer aided dispatch, recorded at the time of stop) and RMS (record management system, recorded later). They have almost completely disjoint fields, and only RMS records have information on searches. We believe the data from the two systems should really be joined, but according to the NJSP there is not a programmatic way to do so. Therefore, we process the CAD data fully, which appear to be the dataset which corresponds to traffic stops. We did noticed that you could join the RMS file if you combine a few of the fields in a certain way. This method isn't perfect, and there are lots of nulls; but we include it in hopes that some data is better than no data.
  • Becuase of the above, we only know search/frisk/contraband information in about 13% of stops.
  • In the CAD data, there are often multiple rows per incident. Some of these are identical duplicates, which we remove. For the remaining records, we group by CAD_INCIDENT, because the NJSP told us that each CAD_INCIDENT ID refers to one stop. We verified that more than 99.9% of CAD_INCIDENT IDs had unique location and time, implying that they did, in fact, correspond to distinct events.
  • driver_race and driver_gender correspond to the race of the driver, not the passenger.
  • Statutes are mapped using the traffic code, where possible.
  • The CAD records contain TOWNSHIP which could be mapped to a county by running the values through the Google geocoder.
  • Additional raw data columns that might be of interest (note, these are only in 13% of data since they come from the spotty, impossible matching described above): Sobriety Test, CCH Check, NCIC Check, Warrant Check, Warrant. Note that since we do not have a guarantee that these 13% of rows with data are a random or representative sample, we do not recommend drawing conclusions from this information.

Camden, NJ

2013-05-01 to 2018-06-13

feature coverage rate
date 99.9%
time 99.9%
location 98.7%
lat 98.1%
lng 98.1%
subject_age 98.9%
subject_dob 98.8%
subject_race 99.0%
subject_sex 99.9%
officer_last_name 100.0%
unit 42.7%
type 100.0%
disposition 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 47.2%
vehicle_color 71.6%
vehicle_make 71.7%
vehicle_model 70.9%
vehicle_registration_state 74.8%
vehicle_year 70.7%
raw_race 98.9%
raw_ethnicity 96.5%

Data notes:

  • Data is deduplicated on case_number, Incident Datetime, IncidentLocation, OfficerName, SubjectGender, Race, Ethnicity, DateOfBirth, VehicleYear, Color, Make, and Model, reducing the number of records by ~5.4%;
  • Data does not contain search/contraband fields
  • There are 3 CFS_Codes, TRAFFIC STOP, PEDESTRIAN STOP, and freeform text, which is classified as vehicular since most reference a driver or traffic stop situation
  • It appears as though Camden police often classify hispanics as white, since the stop rate for whites is extremely high and there are no stops for hispanics
  • According to the PD, a "summons" is a citation, so that corresponds to citation_issued in this data
  • outcomes are based on the disposition column
  • subject_race is based on Race and Ethnicity, which are passed through as raw_race and raw_ethnicity

Henderson, NV

2011-06-30 to 2018-01-31

feature coverage rate
date 100.0%
time 99.8%
location 100.0%
lat 98.4%
lng 98.4%
subject_age 99.0%
subject_dob 98.9%
subject_race 97.4%
subject_sex 98.1%
officer_id 100.0%
officer_id_hash 100.0%
type 100.0%
violation 96.6%
citation_issued 100.0%
outcome 100.0%
vehicle_color 95.5%
vehicle_make 96.1%
vehicle_type 85.5%
vehicle_registration_state 96.4%
raw_race 97.4%

Data notes:

  • Data is deduplicated on raw columns location, city, state, zip, off_dt, off_ti, dob, ht, sex, wt, eye, hair, make, ofcr_id, reducing the total number of records by ~2.1%
  • violation is a concatenation of offense_1 and offense_2 in the original data, separated by "|"
  • Missing reason_for_stop/search/contraband information
  • 2012 has no or very little data for July, August, and September, we have an outstanding inquiry as to why
  • 2018 only has partial data
  • Data before 2011 is filtered out since 2010 data is so sparse it appears to be recording error
  • One of the files, Traffic Stops 01-01-11 to 05-30-18.xlsx came corrupted, we are attempting to get a clean copy of this
  • We assume these are all citations since the primary raw key appears to be 'cite', although we have an outstanding inquiry to confirm this
  • subject_race is based on raw column race, which is passed through as raw_race

Statewide, NV

2012-02-14 to 2016-05-31

feature coverage rate
date 100.0%
subject_age 91.8%
subject_race 99.9%
type 100.0%
violation 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 100.0%
raw_Race 99.9%

Data notes:

  • Nevada does not seem to record Ethnicity or have any records of Hispanic drivers, so we exclude it from our analysis.
  • Nevada does not record time of stop, making it ineligible for VOD analysis.
  • The violation field is a concatenation of two fields in the raw data: infraction codes and offense description.
  • Additional columns in the raw data that may be of interest: Citation Number.

Statewide, NY

2010-01-01 to 2017-12-14

feature coverage rate
date 100.0%
time 100.0%
location 88.8%
county_name 100.0%
subject_age 100.0%
subject_race 100.0%
subject_sex 100.0%
type 100.0%
violation 100.0%
speed 33.4%
posted_speed 33.4%
vehicle_color 99.3%
vehicle_make 99.9%
vehicle_model 0.0%
vehicle_type 100.0%
vehicle_registration_state 97.0%
vehicle_year 98.7%
raw_RACE 100.0%

Data notes:

  • The data include only citations.
  • There is no data on searches.
  • The data stops at 2017-12-14.
  • subject_race is mapped from a raw data column which was passed through as raw_RACE.
  • location is simply a concatenation of three raw data columns: VIO_STREET, HWY_NUM, HWY_TYPE.
  • Additional columns in the raw data that may be of interest: LAW_SECTION, and DCJS_CODE (we do, however, provide violation in the clean data, which is called LAW_DESCRIPTION in the raw data, and appears to simply be the human readable description of LAW_SECTION and DCJS_CODE).

Albany, NY

2008-01-01 to 2017-12-30

feature coverage rate
date 100.0%
time 99.6%
location 90.3%
lat 90.2%
lng 90.2%
subject_age 100.0%
subject_dob 99.9%
subject_race 67.8%
subject_sex 100.0%
type 100.0%
violation 99.5%
vehicle_color 98.7%
vehicle_make 99.2%
vehicle_registration_state 99.4%
vehicle_year 98.8%
raw_race 67.8%

Data notes:

  • Data is deduplicated on incident, mapinfo_lo, date, dob, sex, and race, reducing the number of records by ~28%
  • Search/contraband information is missing, as well as outcomes
  • subject_race is based on the raw column race, which is passed through as raw_race
  • violation represents raw column crime_code_A, which is a description of alphanumeric crime_code column

Columbus, OH

2012-01-01 to 2016-12-30

feature coverage rate
date 100.0%
time 99.9%
location 100.0%
lat 91.8%
lng 91.8%
precinct 88.2%
zone 88.2%
subject_race 100.0%
subject_sex 100.0%
type 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 100.0%
search_conducted 100.0%
reason_for_stop 100.0%
raw_enforcement_taken 100.0%

Data notes:

  • Incident Number in the original data seems unreliable as it has several hundred entries for 9999 and 99999; furthermore, occasionally, it does appear to reference the same incident, but is duplicated for every distinct action taken against the subject
  • The raw data is deduplicated on Stop Date, Contact End Date, Ethnicity, Gender, ViolationStreet, and ViolationCrossStreet, reducing the number of records by ~15.8%
  • search_conducted and outcome are based on Enforcement Taken, which is passed through as raw_enforcement_taken

Statewide, OH

2010-01-01 to 2017-12-31

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 100.0%
lng 100.0%
county_name 99.9%
subject_race 91.3%
subject_sex 91.3%
officer_id 100.0%
officer_id_hash 100.0%
department_name 100.0%
type 100.0%
violation 49.3%
arrest_made 100.0%
warning_issued 100.0%
outcome 39.8%
contraband_found 19.8%
contraband_drugs 100.0%
search_conducted 100.0%
search_basis 100.0%
raw_DISP_STRING 93.4%
raw_ORC_STRING 47.3%
raw_DISPOSITIONS 6.6%
raw_race 91.3%

Data notes:

  • The stop_purpose field is populated by infraction codes. The corresponding laws can be read here.
  • There is no data for contraband being found, but a related field could potentially be reconstructed by looking at searches involving drugs and an arrest. We mark contraband_found as TRUE for drug-related arrests (extracted from raw_ORC_STRING, but we cannot determine if the remainder are FALSE or simply some other type of contraband was recovered).
  • Counties were mapped using the provided dictionary, which is included in the raw data folder.
  • We cannot find disposition codes (in DISP_STRING) which clearly indicate whether a citation as opposed to a warning was given, although there is a disposition for warnings.
  • The data contains stops of both type TS and TSA, standing for "traffic stop" and "traffic stop additional". The latter have a higher search rate and tend to have additional information (i.e., ASINC_STRING is not NA). We include both types in analysis, as they do not appear to be duplicates (addresses and times do not match) and we do not have a clear reason to exclude either.
  • While there is data on search types, they only include consent and K9 searches, suggesting a potential difference in recording policy (many other states have probable cause searches and incident to arrest searches, for example).
  • officer_id refers to a single officer throughout their tenure on the state patrol, but it is re-assigned to a new trooper upon an officer's retirement.
  • raw_DISP_STRING is used to determine subject race, sex, stop outcome, and search information. See processing script for mappings.
  • Violations were mapped from raw_ORC_STRING.
  • 2017 data has a slightly different format: information from DISP_STRING and ORC_STRING exist in raw_DISPOSITIONS for that year.
  • Additional columns from raw data that may be of interest: ASINC_STRING

Cincinnati, OH

2009-01-01 to 2018-05-28

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 99.9%
lng 99.9%
neighborhood 25.6%
beat 24.1%
subject_race 100.0%
subject_sex 99.9%
officer_assignment 98.5%
type 100.0%
disposition 25.3%
arrest_made 99.9%
citation_issued 99.9%
warning_issued 99.9%
outcome 80.9%
reason_for_stop 12.3%
vehicle_make 99.3%
vehicle_model 99.0%
vehicle_registration_state 98.5%
vehicle_year 99.1%
raw_race 100.0%
raw_action_taken_cid 99.9%
raw_field_subject_cid 100.0%

Data notes:

  • Data filters out passengers and where sex is "NON-PERSON" (i.e. business)
  • Data is deduplicated on instance_id, interview_date, address_x, sex, race, and age_range_cid, which reduces the number of rows by ~56%
  • Addresses are "sanitized", i.e. 1823 Field St. -> 18XX Field St. since 83% of given geocodes in the raw data are null, we replace X with 0 and get approximate geocoding locations
  • Data before 2009 is removed since it is so sparse it is likely not to be trusted, and 2018 only has partial data
  • reason_for_stop represents incident_type_desc in the raw data
  • outcomes are based on raw column actiontakencid, which is passed through as raw_action_taken_cid
  • type is based on field_subject_cid, which is passed through as raw_field_subject_cid
  • subject_race is based on race, which is passed through as raw_race
  • There are zero stops of Hispanic individuals reported after 2010.

Oklahoma City, OK

2011-01-01 to 2017-10-18

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 89.9%
lng 89.9%
beat 85.0%
division 85.0%
sector 85.0%
subject_age 99.6%
subject_dob 99.5%
subject_race 99.8%
subject_sex 99.7%
officer_id 100.0%
officer_id_hash 100.0%
type 79.0%
violation 100.0%
citation_issued 100.0%
outcome 100.0%
speed 40.8%
posted_speed 40.8%
vehicle_color 86.3%
vehicle_make 86.2%
vehicle_model 46.2%
vehicle_registration_state 82.7%
vehicle_year 77.7%
raw_dfnd_race 99.8%

Data notes:

  • Data is deduplicated on raw columns violDate, violTime, violLocation, DfndRace, DfndSex, and DfndDOB, reducing the number of records by ~15.7%
  • Partial data from before 2011 is filtered out, although early 2011 still seems to have missing/partial data; the last few months of 2017 are also missing
  • Search/contraband information is missing
  • subject_race is based on DfndRace, which is passed through as raw_dfnd_race; though the data do not include classification of drivers as Hispanic.

Tulsa, OK

2009-01-01 to 2016-12-31

feature coverage rate
date 100.0%
time 100.0%
location 99.0%
lat 90.6%
lng 90.6%
division 100.0%
subject_race 99.0%
subject_sex 99.2%
type 70.6%
violation 100.0%
speed 38.6%
posted_speed 41.0%
vehicle_color 91.9%
vehicle_make 94.0%
vehicle_model 83.2%
vehicle_registration_state 93.5%
vehicle_year 93.3%
raw_race 99.0%

Data notes:

  • Data is deduplicated on raw columns violationdate, violation_location, officerdiv, race, and sex, reducing the number of records by ~30.0%
  • Data is all citations
  • Data appears to be all vehicular, although the PD hasn't confirmed that yet
  • subject_race is based on raw column race, which is passed through as raw_race

Statewide, OR

2010-01-01 to 2014-01-01

feature coverage rate
date 92.6%
subject_race 100.0%
type 100.0%
raw_Race 100.0%

Data notes:

  • There is basically no data, including no data on Hispanic drivers, so we exclude Oregon from our analysis.
  • Counts for 2015 and 2016 are much lower than in earlier years.
  • subject_race is mapped from raw_Race

Philadelphia, PA

2014-01-01 to 2018-04-14

feature coverage rate
date 100.0%
time 100.0%
location 98.0%
lat 94.4%
lng 94.4%
district 100.0%
service_area 100.0%
subject_age 99.8%
subject_race 100.0%
subject_sex 100.0%
type 100.0%
arrest_made 100.0%
outcome 5.1%
contraband_found 100.0%
frisk_performed 100.0%
search_conducted 100.0%
search_person 100.0%
search_vehicle 100.0%
raw_race 100.0%
raw_individual_contraband 100.0%
raw_vehicle_contraband 100.0%

Data notes:

  • Data is deduplicated on raw columns datetimeoccur, location, districtoccur, lat, lng, gender, age, race, stoptype, individual_frisked, individual_searched, individual_arrested, individual_contraband, vehicle_frisked, vehicle_searched, vehicle_contraband, reducing the number of records by ~1.4%
  • Information on citations and warnings is missing, but arrests are included
  • search_person and search_vehicle correspond to raw columns individual_searched and vehicle_searched; we filled in false for NA values under the assumption that unrecorded search data represented the absence of a search
  • contraband_found is based on raw columns individual_contraband and vehicle_contraband, which are passed through as raw_*; if both of these were null and search_conducted was true, contraband_found was set to false
  • subject_race is based on the raw column race, which is passed through as raw_race
  • 2018 has only partial data, and it appears to be the same for early 2014

Pittsburgh, PA

2008-01-01 to 2018-04-29

feature coverage rate
date 99.9%
time 99.9%
location 100.0%
lat 97.7%
lng 97.7%
neighborhood 82.7%
subject_age 17.1%
subject_race 88.6%
subject_sex 96.2%
officer_id 100.0%
officer_id_hash 100.0%
officer_age 92.0%
officer_race 77.7%
officer_sex 78.1%
type 100.0%
violation 82.7%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 88.6%
contraband_found 86.8%
frisk_performed 82.7%
search_conducted 100.0%
reason_for_stop 17.3%
raw_zone 8.7%
raw_object_searched 12.5%
raw_race 100.0%
raw_ethnicity 16.4%
raw_zone_division 17.3%
raw_evidence_found 2.0%
raw_weapons_found 0.4%
raw_nothing_found 79.5%
raw_police_zone 82.7%
raw_officer_race 78.1%
raw_officer_zone 82.7%

Data notes:

  • The raw data for pedestrian stops actually has many cities in it, but here we filter to only Pittsburgh; vehicular stops do not have an associated city, and so are assumed to be only Pittsburgh
  • Raw data for vehicle stops has stop end time as well
  • There are instances when evidencefound is true but contrabandfound is NA, so we have an oustanding inquiry as to what evidencefound refers to; similarly, weaponsfound is sometimes true when contrabandfound is false and vice versa, so it's unclear whether the contraband is weapons or not, so for now we leave out contraband_weapons and have another outstanding inquiry
  • if a search was conducted and the stop type was vehicular (pedestrian stops don't provide search outcomes) and contrabandfound was NA, we set contraband_found to false, otherwise we use the value in the contrabandfound field. We do this under the assumption that false and NA for contraband_found are equivalent when a search occured, i.e. an officer conducted a search and either found nothing or recorded nothing
  • search_conducted is true when any one of objectsearched (pedestrian stops), contrabandfound, evidencefound, weaponsfound, and nothingfound (vehicular stops) is not NA; all these are passed on as raw_*
  • Sex and gender do not match 73% of the time in pedestrian data, and race and ethnicity mismatch often as well. In both cases, if sex != gender or race != ethnicity, we set the value to NA, otherwise we coalesce(sex, gender) or coalesce(race, ethnicity) [this keeps values when one is NA but the other isn't]; we pass through all the raw values as raw_*
  • There are 4 zone-related columns in the raw data: zone, zone_division, policezone, and officerzone; we pass them through as raw_*
  • The data is deduplicated on raw columns stop_date, stopstart, stopend, address, officer_id, and person_id, reducing the number of rows by ~21.1%
  • violation represents raw column crimedescription
  • 2008 and early 2009 appear to have partial data and 2018 only has the first 4 months

Statewide, RI

2005-01-02 to 2015-12-31

feature coverage rate
date 100.0%
time 100.0%
zone 100.0%
subject_yob 94.3%
subject_race 94.3%
subject_sex 94.3%
department_id 100.0%
type 100.0%
arrest_made 94.3%
citation_issued 94.3%
warning_issued 94.3%
outcome 93.0%
contraband_found 100.0%
contraband_drugs 73.0%
contraband_weapons 9.3%
contraband_alcohol 0.2%
contraband_other 3.5%
frisk_performed 100.0%
search_conducted 100.0%
search_basis 100.0%
reason_for_search 100.0%
reason_for_stop 94.3%
vehicle_make 62.4%
vehicle_model 45.1%
raw_BasisForStop 94.3%
raw_OperatorRace 94.3%
raw_OperatorSex 94.3%
raw_ResultOfStop 94.3%
raw_SearchResultOne 3.5%
raw_SearchResultTwo 0.2%
raw_SearchResultThree 0.0%

Data notes:

  • The stops are mapped to state patrol zones, which represent police barrack juridisdiction areas. However, there is no simple mapping between zones and counties. We store state patrol zones in the district column and use this column in our granular location analyses.
  • contraband information was mapped from raw_SearchResult[One/Two/Three].
  • Column search_basis is a standardized version of reason_for_search, which, if multiple reasons are provided, uses the hierarchy of: plain view, probable cause, other. And if no search reason is given, we default to probable cause. Note that while the raw data contains a ConsentRequested column, we have no information about whether consent was given.
  • Additional columns in the raw data that may be of interest: SearchFrisk[One/Two/Three] (says whether searches and frisks were of the driver, passenger, or vehicle), Duration (A/B/C/NA), AdditionalOccupants, Road (I/S/N/NA), PlateType, PriorRecord (Y/N/T/NA), ConsentRequested.

Statewide, SC

2005-01-01 to 2016-12-31

feature coverage rate
date 100.0%
location 100.0%
lat 23.4%
lng 23.4%
county_name 100.0%
subject_age 100.0%
subject_race 100.0%
subject_sex 100.0%
officer_id 98.5%
officer_id_hash 98.5%
officer_age 99.5%
officer_race 100.0%
officer_last_name 100.0%
department_id 100.0%
type 100.0%
violation 66.3%
arrest_made 100.0%
citation_issued 100.0%
outcome 66.3%
contraband_found 100.0%
contraband_drugs 99.2%
contraband_weapons 99.2%
contraband_alcohol 0.0%
contraband_other 2.1%
search_conducted 100.0%
search_person 100.0%
search_vehicle 100.0%
reason_for_stop 100.0%
raw_contact_type 100.0%
raw_sex 100.0%
raw_race 100.0%
raw_sectionnum 66.3%
raw_offensecode 66.3%
raw_contrabanddesc 0.7%
raw_officer_race 100.0%

Data notes:

  • The police_department field is populated by state patrol agency.
  • More data on local stops is available here. It is aggregated by race and age group — potentially scrapable if useful.
  • While there is data on violation, many of the stops have missing data.
  • Violation is a concatenation of sectionnum and offensecode in the raw data.
  • Additional columns in the raw data that may be of interest (Note, many of these were used to construct search/contraband/arrest/outcome information in the clean data. See processing script for details.): jailed, felonyarrest, armedwith (messy free field), using[drugs/alcohol], contraband[drugs/drugparaphenalia/weapons/other] (sic), [passenger/subject/vehicle]searched.

Statewide, SD

2012-01-01 to 2016-02-29

feature coverage rate
date 100.0%
time 99.9%
location 16.6%
county_name 99.8%
subject_sex 99.2%
type 100.0%
violation 98.9%
citation_issued 100.0%
warning_issued 100.0%
outcome 100.0%
vehicle_color 76.0%
vehicle_make 92.4%
vehicle_model 79.6%
vehicle_registration_state 98.2%
vehicle_year 77.2%

Data notes:

  • Race data is missing, so we exclude South Dakota from our analysis.
  • Some county names were misrecorded.
  • Additional columns in raw data that may be of interest: Eye Color, Insurance, Commerical Vehicle (sic), Is Accident, Haz Mat Vehicle.

Statewide, TN

1971-01-05 to 2016-06-26

feature coverage rate
date 100.0%
time 100.0%
location 71.0%
county_name 99.1%
subject_race 99.2%
subject_sex 99.8%
department_id 100.0%
department_name 100.0%
type 100.0%
violation 92.3%
citation_issued 100.0%
outcome 100.0%
vehicle_make 99.4%
vehicle_model 95.5%
vehicle_year 94.5%
raw_ORIG_TRFC_VIOL_CDE 100.0%
raw_CNTY_NBR 100.0%
raw_RACE_IND 99.2%
raw_SEX_IND 99.8%

Data notes:

  • The data contain only citations.
  • The codes in the CNTY_NBR field represent counties ordered alphabetically.
  • location is a concatenation of raw fields UP_STR_HWY (highway/street) and MLE_MRK_NBR (mile marker). It would be possible to map the highway and mile marker data to geo coordinates, as we did in Washington. However, since we are often missing mile marker or even mile marker and highway, we did not do so (as most would be NA).
  • raw_ORIG_TRFC_VIOL_CDE maps to violation, raw_CNTY_NBR maps to county_name, raw_RACE_IND maps to subject_race, raw_SEX_IND maps to subject_sex.
  • Additional raw data columns that may be of interest: SPEED, SPEED_LMT, TN_RSDNT_IND (resident boolean), HZRD_MTRL_IND (hazardous material boolean), MTR_CYCL_IND (motorcycle boolean), CNSTR_ZNE (construction zone boolean), WRKR_PRSNT (worker present in construction zone boolean), TRVL_DRCT (travel direction), ACCD_IND (accident boolean), CMV_IND (commercial vehicle boolean)

Nashville, TN

2010-01-01 to 2019-03-24

feature coverage rate
date 100.0%
time 99.8%
location 100.0%
lat 94.0%
lng 94.0%
precinct 87.4%
reporting_area 89.2%
zone 87.4%
subject_age 100.0%
subject_race 99.9%
subject_sex 99.6%
officer_id 100.0%
officer_id_hash 100.0%
type 100.0%
violation 99.7%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 99.9%
contraband_found 100.0%
contraband_drugs 100.0%
contraband_weapons 100.0%
frisk_performed 100.0%
search_conducted 100.0%
search_person 100.0%
search_vehicle 100.0%
search_basis 100.0%
reason_for_stop 99.7%
vehicle_registration_state 99.0%
notes 16.6%
raw_verbal_warning_issued 100.0%
raw_written_warning_issued 84.0%
raw_traffic_citation_issued 100.0%
raw_misd_state_citation_issued 77.6%
raw_suspect_ethnicity 100.0%
raw_driver_searched 100.0%
raw_passenger_searched 100.0%
raw_search_consent 100.0%
raw_search_arrest 100.0%
raw_search_warrant 100.0%
raw_search_inventory 100.0%
raw_search_plain_view 100.0%

Data notes:

  • Data is deduplicated on raw columns stop_date_time, stop_location_street, officer_employee_number, race, sex, and age_of_suspect, reducing the number of records by ~0.3%
  • There are 30 (of ~2.6M records) cases where search_conducted is ambiguous after the merge and are left as NA, since it's unclear whether they are true or false, since being NA after the above merge indicates that there were two distinct values for raw column searchoccur
  • reason_for_stop and violation are both translations of the original stop_type column; this column is sometimes the pretextual reason for the stop and does not always represent what the individual was ultimately cited for
  • contraband_drugs is raw column drugs_seized, contraband_weapons is weapons_seized, and contraband_found is evidenceseized
  • citation_issued is derived from traffic_citation_issued and misd_state_citation_issued, which are passed through as raw_*; misd_state_citation_issued is sometimes NA, so for the purposes of defining citation_issued, we consider NA to be false
  • warning_issued is derived from verbal_warning_issued and written_warning_issued, which are passed through as raw_*; written_warning_issued is sometimes NA, so for the purposes of defining warning_issued, we consider NA to be false
  • search_basis is based on the raw columns search_plain_view, search_consent, search_incident_to_arrest, search_warrant, and search_inventory, which are all passed on with the raw_* prefix
  • subject_race is derived from raw columns suspect_ethnicity and suspect_race, which are passed through with the raw_* prefix
  • search_person is derived from search_driver and search_passenger, which are passed through with the raw_* prefix
  • When contraband_found is NA, we fill it with false when a search occurred, under the assumption that the officer simply didn't record the absence of contraband

Arlington, TX

2016-01-01 to 2016-12-31

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 99.8%
lng 99.8%
beat 99.0%
district 99.0%
sector 98.1%
subject_race 100.0%
subject_sex 100.0%
officer_id 99.9%
officer_id_hash 99.9%
type 100.0%
outcome 0.0%
search_conducted 100.0%
reason_for_stop 100.0%
raw_1st_digit_race 100.0%
raw_4th_digit_final_outcome 100.0%
raw_6th_digit_search_outcome 100.0%

Data notes:

  • Unclear what PRA, xCoordinate, and yCoordinate are in the raw data
  • Missing data dictionaries for reason_for_stop, outcome, and search_ outcome, the latter two are passed through as raw_*
  • subject_race is based on raw column 1st digit (Race), which is passed through as raw_1st_digit_race
  • Only 2016 data was provided

Austin, TX

2006-01-01 to 2016-06-30

feature coverage rate
date 100.0%
subject_age 99.4%
subject_race 100.0%
subject_sex 99.9%
officer_id 100.0%
officer_id_hash 100.0%
type 77.2%
contraband_found 100.0%
contraband_drugs 100.0%
contraband_weapons 100.0%
frisk_performed 100.0%
search_conducted 100.0%
search_person 100.0%
search_vehicle 100.0%
search_basis 100.0%
reason_for_stop 100.0%
vehicle_make 97.4%
vehicle_model 31.0%
vehicle_registration_state 98.5%
vehicle_year 77.1%
raw_ethnicity 66.6%
raw_person_search_search_based_on 3.6%
raw_person_search_search_discovered 4.1%
raw_person_searched 89.8%
raw_vehicle_search_search_based_on 2.0%
raw_vehicle_search_search_discovered 2.4%
raw_vehicle_searched 83.4%
raw_race_description 100.0%
raw_street_check_description 100.0%

Data notes:

  • Data is deduplicated on raw column street_check_case_number, occurred_date, officer, sex, race, ethnicity, yob, veh_type, veh_year, veh_make, veh_model, veh_style, and soi, reducing the number of rows by ~0.5%
  • Data does not include location or outcomes
  • There are no clear pedestrian-only discretionary stops in reason_checked_description; SUSPICIOUS PERSON / VEHICLE is one category in reason_for_stop, but is included with "vehicular" stops; as such, it may over count vehicular stops
  • reason_for_stop represents raw column reason_checked_description
  • search_person and search_vehicle represent person_searched and vehicle_searched in the raw data, which are passed through with raw_*; for the canonical columns search_person and search_vehicle, NA values are changed to false under the assumption that the absence of a search may not always be recorded
  • frisk_performed is based on person_search_search_based_on, and is false when that column is NA, on the assumption that the officer did not record the absence of a frisk
  • search_basis is derived from person_search_search_based_on and vehicle_search_search_based on, which are passed through with the raw_* prefix
  • contraband_{found,drugs,weapons} are derived from person_search_search_discovered and vehicle_search_search_discovered, which are passed through with raw_* prefix; when these values are NA, they are assumed to be FALSE for contraband discovery
  • reason_for_stop represents the raw column reason_checked_description; although, the raw column street_check_description also seems to provide information, so is passed through with the raw_ prefix
  • subject_race is based on raw columns race and ethnicity; there is also a raw race_description column, which is passed through with the raw_ prefix (instead of the race column, since it is just a nicer translation of the single characters in race); ethnicity is also passed through with the raw_ prefix

Garland, TX

2012-01-03 to 2019-06-22

feature coverage rate
date 100.0%
time 99.9%
location 0.0%
subject_race 100.0%
subject_sex 100.0%
officer_id 100.0%
officer_id_hash 100.0%
officer_race 34.1%
officer_sex 97.4%
officer_first_name 99.9%
officer_last_name 100.0%
type 100.0%
disposition 100.0%
violation 100.0%
citation_issued 100.0%
outcome 100.0%
speed 49.8%
posted_speed 49.8%
vehicle_color 98.6%
vehicle_make 99.5%
vehicle_registration_state 99.2%
vehicle_year 58.4%
raw_race 100.0%
raw_alleged_speed 81.8%
raw_posted_speed 81.8%

Data notes:

  • Data is deduplicated on raw columns sex, race, vehicle_year, vehicle_color, make, vehicle_state, incident_date, incident_time, and officer_badge, reducing the number of records by ~33.1%
  • incident_address (location in clean) is 100% null, we have an outstanding inquiry here
  • We assume these are all citations since they appear to be indexed by ticket number, but we have an outstanding task to clarify this
  • violation represents offense_title in the raw data
  • Data is lacking reason_for_stop/search/contraband information
  • officer_race is mostly NA or "U", the remainder are white or Asian/Pacific Islander, so this data is probably unreliable
  • subject_race is based on raw column race, which is passed through with the raw_ prefix
  • Sometimes the same stop has different speeds recorded; often a pair of legitimate values, i.e. going 55 in a 40, but the others will have 0 and 0 or NA and NA, since possibly multiple tickets are issued for the same stop; for each record, we take the max of each to represent the speeds; the raw_alleged_speed and raw_posted_speed are passed through; when the values were 0 or -Inf, we set them to NA under the assumption that this was a stop unrelated to speed
  • 2012 and 2018 have only partial data

Houston, TX

2014-01-01 to 2018-04-08

feature coverage rate
date 100.0%
location 92.1%
lat 91.7%
lng 91.7%
beat 86.5%
district 86.5%
subject_race 82.9%
subject_sex 99.6%
type 100.0%
violation 100.0%
citation_issued 100.0%
outcome 100.0%
speed 29.9%
posted_speed 30.9%
vehicle_color 96.3%
vehicle_make 98.4%
vehicle_model 96.8%
raw_race 82.9%

Data notes:

  • Data is deduplicated on raw columns Defendant Name, Gender, Race, Street, Block, Scnd Street, Scnd Block, Officer Name, and Offense Date, reducing the number of records by ~0.02%; there is a possibility this over collapses rows in the case where an officer pulls over the same person twice in the same day at the same location
  • Data is lacking search/contraband information
  • Data consists only of citations
  • When speed and posted_speed were 0, we set them to NA, under the assumption that this was a default value and the stop was unrelated to speed
  • subject_race is based on the raw column Race, passed through as raw_race

Lubbock, TX

2008-05-01 to 2018-04-30

feature coverage rate
date 100.0%
location 100.0%
lat 99.7%
lng 99.7%
officer_first_name 71.0%
officer_last_name 99.0%
type 100.0%
disposition 99.6%
citation_issued 99.6%
warning_issued 99.6%
outcome 82.0%

Data notes:

  • Insufficient information here to deduplicate records, if there are duplicates
  • Missing reason_for_stop/search/contraband/subject_sex/subject_race data
  • There is an outstanding ask for a data dictionary for the disposition codes

Plano, TX

2012-01-01 to 2015-12-31

feature coverage rate
date 100.0%
time 99.4%
location 49.7%
lat 48.6%
lng 48.6%
beat 46.6%
sector 46.6%
subject_age 0.6%
subject_race 100.0%
subject_sex 100.0%
officer_id 23.1%
officer_id_hash 23.1%
officer_last_name 48.5%
unit 23.1%
type 100.0%
violation 100.0%
arrest_made 98.0%
citation_issued 85.9%
warning_issued 31.9%
outcome 98.2%
contraband_found 100.0%
contraband_drugs 100.0%
contraband_weapons 100.0%
search_conducted 100.0%
search_basis 100.0%
speed 12.6%
posted_speed 13.3%
vehicle_color 22.9%
vehicle_make 23.0%
vehicle_model 22.5%
vehicle_type 21.9%
notes 22.8%
raw_race 100.0%
raw_contraband 21.5%
raw_contraband_found 9.1%
raw_results 74.4%
raw_ethnicity 100.0%

Data notes:

  • Data is rather messy from year to year with different columns, and files with "all_traffic_stops" in the name are difficult to join into the other incident data, since the incident number in those files is populate donly ~15% of the time; location data is spread across 4 columns in different files, and each is null at least 75% of the time
  • violation is a concatenation of violation_description, primary_violation, offense, and offense_[1-8], which are all null most of the time, the separator is a comma
  • Data is deduplicated on date, time, location, officer_id, subject_age, subject_race, and subject_sex, reducing the number of records by ~0.0004%, but some of this may be over-deduplication because NAs are common in location, officer_id, and subject_age
  • location is a coalesced version of the raw columns location, violation_location, offense_location, and arrest_location, all of which are ~75% null independently
  • raw_results is a concatenation of officer_result, result, and result_[1-8], separated by a comma; outcomes are based on this column, as well as the warning, citation, and citation_number columns in the raw data
  • search_conducted represents search_conducted, search_performed, and searched raw columns coalesced (they are mutually exclusive); similarly, consent in search_basis is based on search_consent, search_consent_2, and consent in the raw data, coalesced, and arrest_made is a coalesced version of arrest and arrested
  • When the contraband and contraband_found raw columns are NA, they are assumed to be false or no contraband found for the canonical contraband_found column in the clean data

Statewide, TX

2006-01-01 to 2017-12-31

feature coverage rate
date 100.0%
time 100.0%
location 92.0%
lat 58.2%
lng 58.2%
county_name 100.0%
district 92.0%
precinct 32.9%
region 92.0%
subject_race 100.0%
subject_sex 100.0%
officer_id 100.0%
officer_id_hash 100.0%
officer_last_name 64.1%
type 100.0%
violation 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 100.0%
contraband_found 100.0%
contraband_drugs 100.0%
contraband_weapons 100.0%
search_conducted 92.0%
search_vehicle 90.4%
search_basis 100.0%
vehicle_color 45.2%
vehicle_make 71.9%
vehicle_model 66.3%
vehicle_type 99.9%
vehicle_year 67.1%
raw_HA_SEARCH_PC_boolean 92.0%
raw_HA_SEARCH_CONCENT_boolean 92.0%
raw_HA_INCIDTO_ARREST_boolean 92.0%
raw_HA_VEHICLE_INVENT_boolean 92.0%

Data notes:

  • There is evidence that minority drivers are labeled as white in the data. For example, see this report from KXAN. We remapped the driver race field as provided using the 2000 surnames dataset released by the U.S. Census. See the processing script or paper for details.
  • We asked whether there was a field which provided arrest data, but received no clarification. There is data on incident to arrest searches, but this does not necessarily identify all arrests.
  • Based on the provided data dictionary as well as clarification from DPS via email, we classify THP6 and TLE6 in HA_TICKET_TYPE as citations and HP3 as warnings.
  • The data only records when citations and warnings were issued, but not arrests.
  • We did not receive any search information in the 2017 data.

San Antonio, TX

2012-01-01 to 2018-04-19

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 99.8%
lng 99.8%
district 92.4%
substation 92.4%
subject_age 99.9%
subject_race 100.0%
subject_sex 99.7%
type 97.5%
violation 100.0%
arrest_made 100.0%
citation_issued 100.0%
outcome 100.0%
contraband_found 100.0%
search_conducted 100.0%
search_basis 100.0%
speed 52.5%
posted_speed 52.5%
vehicle_color 83.5%
vehicle_make 83.9%
vehicle_model 83.4%
vehicle_registration_state 83.7%
vehicle_year 82.5%
raw_race 100.0%
raw_posted_speed 100.0%
raw_actual_speed 100.0%
raw_search_reason 15.0%
raw_contraband_or_evidence 15.0%
raw_custodial_arrest_made 14.6%

Data notes:

  • Data is deduplicated on citation_number, reducing the number of rows by 23.3%; deduping on date, time, location, subject_race, subject_sex, and subject age instead reduces the number of records by 25.3%, roughly 2% more than only citation number, but, curiously, there are often rows that have identical information on those columns but different recorded speeds; so, it's unclear whether these are duplicates with misentries or distinct events; we air on the side of caution and consider them distinct events; there also appears to be multiple offenses related to each citation, and those not involving speed are set to 0; accordingly, we take the maximum speed and posted speed to represent the speeds for every citation/record
  • Data consists only of arrests and citations
  • contraband_found is based on the raw column Contraband Or Evidence; when this is NA, it is set to false under the assumption that an officer may not always record the absence of contraband found; the raw column is passed through as raw_contraband_or_evidence
  • search_basis is based on the raw column Search Reason, which is passed through as raw_search_reason
  • search_conducted is false when Search Reason is NA, "No Search", or one of the ~200 entries that look like incorrect entries, i.e. A, 9, 6
  • subject_race is based on the raw column Race and is passed through as raw_race
  • arrest_made is based on raw column Custodial Arrest Made, which is passed through as raw_custodial_arrest made; arrest_made true when Custodial Arrest Made is true and false when it is false or NA
  • 2018 has only the first 4 months of data

Statewide, VA

2006-01-07 to 2016-04-23

feature coverage rate
date 100.0%
location 100.0%
county_name 87.3%
subject_race 100.0%
officer_id 100.0%
officer_id_hash 100.0%
officer_race 0.0%
officer_first_name 100.0%
officer_last_name 100.0%
type 100.0%
search_conducted 100.0%
raw_officer_race 100.0%
raw_race 100.0%

Data notes:

  • The original data was aggregated by week.
  • Some rows have an unlikely high number of stops or searches. We have an outstanding inquiry on this, but have not heard back. In particular, spikes in each week seem to usually be driven by a single officer with an unlikely high number of stops or searches (e.g., about 1,000 searches by an officer in a single week). Each spike seems to be driven by a different officer. Since this reporting seems highly unlikely, we exclude VA from search analyses.
  • Counties were mapped using the provided dictionary, which is included in the raw data folder.
  • There are no written warnings in Virginia and verbal warnings are not recorded, so all records are citations or searches without further action taken.
  • In the raw data, "Traffic arrests" refer to citations without a search. "Search arrests" refer to a citation and a search (either before or after the citation). "Search stops" refer to searches without a corresponding citation.
  • Additional columns in raw data that may be of interest: officer name.

Burlington, VT

2012-01-01 to 2017-12-31

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 98.7%
lng 98.7%
subject_age 98.0%
subject_dob 97.9%
subject_race 96.6%
subject_sex 97.5%
department_name 100.0%
type 100.0%
violation 100.0%
arrest_made 99.2%
citation_issued 99.2%
warning_issued 99.2%
outcome 99.1%
contraband_found 100.0%
search_conducted 100.0%
search_basis 100.0%
reason_for_search 100.0%
reason_for_stop 98.7%
vehicle_registration_state 17.5%
raw_race 96.7%
raw_gender 97.5%
raw_contraband_evidence 98.6%
raw_outcome_of_stop 99.2%

Data notes:

  • Data is deduplicated on raw columns issued_at, location, race, gender, city, dob, lat, lon, reducing the number of records by ~7.0%
  • Calls are also provided in the raw data, but aren't loaded here
  • subject_race is based on the raw column race which is passed through as raw_race, and gender is passed through as raw_gender
  • reason_for_stop represents the raw column stop_based_on, and reason_for_search represents the raw column search_based_on and forms the basis for search_conducted and search_basis
  • When reason_for_search, i.e. search_based_on, is NA, we assume search conducted is false
  • outcomes are based on raw column outcome_of_stop, which is passed through as raw_outcome_of_stop

Statewide, VT

2010-07-01 to 2015-12-31

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 92.6%
lng 92.6%
subject_age 99.6%
subject_race 98.6%
subject_sex 99.4%
officer_id 100.0%
officer_id_hash 100.0%
department_name 100.0%
type 100.0%
arrest_made 99.2%
citation_issued 99.2%
warning_issued 99.2%
outcome 99.2%
contraband_found 100.0%
search_conducted 100.0%
search_basis 100.0%
raw_stop_city 99.8%
raw_stop_reason_description 99.2%
raw_stop_search_description 99.2%
raw_stop_outcome_description 99.2%
raw_driver_gender 99.4%
raw_driver_race 98.6%

Data notes:

  • Stop purpose information is not very granular — there are only five categories, and we have no way of identifying speeding. See raw_stop_reason_description.
  • The search type field includes "Consent search — probable cause" and “Consent search — reasonable suspicion". It is not entirely clear what these mean; we cannot find analogues in other states.
  • Counties could be mapped by running the cities in the raw_stop_city field through Google's geocoder.
  • location is a simple concatenation of address, city, state, zip.
  • search_conducted was mapped from raw_stop_search_description.
  • contraband_found was mapped from raw_stop_contraband_description.

Statewide, WA

2009-01-01 to 2015-12-31

feature coverage rate
date 72.9%
time 100.0%
location 91.3%
lat 86.7%
lng 86.7%
county_name 86.7%
subject_age 71.6%
subject_race 71.8%
subject_sex 71.9%
officer_race 72.9%
officer_sex 100.0%
officer_first_name 100.0%
officer_last_name 100.0%
department_name 100.0%
type 100.0%
violation 41.2%
arrest_made 64.3%
citation_issued 100.0%
warning_issued 100.0%
outcome 70.9%
contraband_found 100.0%
frisk_performed 100.0%
search_conducted 100.0%
search_basis 100.0%
raw_officer_race 72.9%
raw_officer_gender 100.0%
raw_contact_type 100.0%
raw_driver_race 71.8%
raw_driver_gender 71.9%
raw_search_type 71.9%
raw_enforcements 70.9%

Data notes:

  • Counties were mapped by doing a reverse look-up of the geo lat/long coordinate of the highway post that was recorded for the stop, then mapping that latitude and longitude to a county using a shapefile. Details are in the WA_map_locations.R script.
  • Arrests and citations are grouped together in the stop_outcome, so we cannot reliably identify arrests. There is data on incident to arrest searches, but this does not necessarily identify all arrests.
  • If one were to dedupe on employee_last, employee_first, officer_race, officer_gender, contact_date, contact_hour, highway_type, road_number, milepost, driver_race, driver_age, driver_gender it would yield ~3.4% fewer rows. Without deduping, there are a few officers who seem to stop a suspiciously high, but not altogether unreasonable, number of people in a an hour. However, we ultimately choose not to dedupe since most of the "duplicate" rows have NA for the driver demographics and other fields.
  • Weigh station stops were removed.
  • raw_enforcements is simply a concatenation of 12 enforcement columns in the raw data.
  • Additional columns in the raw data that may be of interest: officer name

Tacoma, WA

2007-09-11 to 2017-09-10

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 82.5%
lng 82.5%
sector 79.7%
subsector 79.7%
officer_id 100.0%
officer_id_hash 100.0%
type 100.0%
disposition 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 64.5%

Data notes:

  • reason_for_stop is not recorded, and search/contraband information is not in their database, only in written reports; subject_race is also not recorded

Seattle, WA

2006-01-01 to 2015-12-31

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 91.8%
lng 91.8%
beat 90.7%
precinct 90.1%
sector 90.7%
subject_age 31.1%
subject_dob 31.0%
subject_race 0.1%
subject_sex 0.1%
officer_id 96.0%
officer_id_hash 96.0%
officer_first_name 85.3%
officer_last_name 96.0%
type 100.0%
disposition 100.0%
violation 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 58.5%
vehicle_color 4.4%
vehicle_make 3.4%
vehicle_model 3.3%
vehicle_registration_state 6.3%
vehicle_year 0.2%
raw_type_description 98.0%
raw_vehicle_description 6.3%

Data notes:

  • citation_issued includes criminal and non-criminal citations
  • violation represents raw column mir_description
  • type is based on violation (mir_description) and type_description, which is passed through as raw_type_description
  • outcomes are based on disposition, which represents raw column disposition_description
  • vehicle_* columns are based on a coalesced combination of veh and vehcile (sic) columns in the raw data; this is passed through as raw_vehicle_description

Madison, WI

2007-09-28 to 2017-09-28

feature coverage rate
date 100.0%
time 100.0%
location 94.2%
lat 92.2%
lng 92.2%
district 85.5%
sector 85.5%
subject_race 98.8%
subject_sex 99.5%
officer_first_name 100.0%
officer_last_name 100.0%
type 100.0%
violation 99.4%
citation_issued 100.0%
warning_issued 100.0%
outcome 100.0%
speed 24.6%
posted_speed 24.6%
vehicle_color 93.8%
vehicle_make 97.3%
vehicle_model 24.8%
vehicle_registration_state 98.7%
vehicle_year 97.8%
raw_race 98.8%

Data notes:

  • Data is deduplicated on raw columns Date, Time, onStreet, onStreetName, OfficerName, Race, Sex, Make, Model, Year, State, Limit, and OverLimit, reducing the numbe rof rows by ~0.7%
  • violation represents raw column Statute Description
  • Search/contraband information is missing
  • Data only includes warnings and citations, no arrests
  • If there was no Ticket #, this was assumed to be a warning
  • Shapefiles don't include district 2 and it's accompanying sectors
  • subject_race is based on raw column Race, passed through as raw_race
  • 2007 has partial data and looks suspect; 2017 is missing October, November, and December

Statewide, WI

2010-01-01 to 2016-05-16

feature coverage rate
date 100.0%
time 100.0%
location 100.0%
lat 32.7%
lng 32.7%
county_name 100.0%
subject_race 85.5%
subject_sex 85.6%
officer_first_name 99.6%
officer_last_name 100.0%
department_id 100.0%
department_name 100.0%
type 100.0%
violation 100.0%
arrest_made 100.0%
citation_issued 100.0%
warning_issued 100.0%
outcome 99.9%
contraband_found 100.0%
contraband_drugs 91.4%
contraband_weapons 78.8%
contraband_alcohol 0.8%
contraband_other 1.2%
search_conducted 100.0%
search_person 100.0%
search_vehicle 100.0%
search_basis 99.9%
vehicle_color 86.9%
vehicle_make 87.0%
vehicle_model 74.6%
vehicle_type 87.3%
vehicle_registration_state 72.2%
vehicle_year 77.3%
raw_onHighwayDirection 87.6%
raw_onHighwayName 89.7%
raw_fromAtStreetName 75.9%
raw_race 85.5%
raw_sex 85.6%
raw_individualSearchConducted 85.7%
raw_vehicleSearchConducted 100.0%
raw_individualContraband 1.1%
raw_vehicleContraband 1.0%
raw_summaryOutcome 100.0%
raw_individualSearchBasis 1.1%
raw_vehicleSearchBasis 1.0%

Data notes:

  • The data come from two systems ("7.3" and "10.0") that succeeded each other. They have different field names and are differently coded. This is particularly relevant for the violation field, which has a different encoding between the two systems; in order to map violations, we used the dictionaries provided by the state for both systems.
  • There are two copies of the data: warnings and citations. Citations seems to be a strict subset of warnings, with some citation codes being different.
  • The police_department field is populated by highway patrol agencies. There are only 6 of them.
  • There are very few consent searches relative to other states, suggesting a potential difference in recording policy.
  • raw_[individual/vehicle]Contraband were mapped using a data dictionary provided by the department: 01 = WEAPON(S); 02 = EXCESSIVE CASH; 03 = ILLICIT DRUG(S)/PARAPHERNALIA; 04 = EVIDENCE OF A CRIME; 05 = INTOXICANT(S); 06 = STOLEN GOODS; 99 = OTHER; 00 = NONE
  • raw_[individual/vehicle]SearchBasis were mapped using a data dictionary. There is no code for "plain view". 1 = Consent; 2 = Probable Cause; the rest of the search basis categories are are Warrant, Incident to Arrest, Inventory, and Exigent Circumstances
  • violation was mapped directly from StatuteDescription in the raw data.
  • location is a concatenation of raw_onHighwayDirection, raw_onHighwayName, raw_fromAtStreetName, and county_name
  • There are about 150 columns in the raw data (many columns about road type and conditions, many about vehicle details, etc.), however, the vast majority of the columns are 95-100% empty.

Statewide, WY

2011-01-01 to 2012-12-31

feature coverage rate
date 99.8%
time 100.0%
location 100.0%
county_name 98.7%
subject_age 99.6%
subject_race 99.7%
subject_sex 99.3%
officer_id 100.0%
officer_id_hash 100.0%
department_id 93.8%
type 100.0%
outcome 100.0%
raw_race 99.7%
raw_sex 99.7%
raw_streetnbr 97.0%
raw_street 99.5%

Data notes:

  • Only citations are included in the data.
  • The department_name field is populated by the state trooper division.
  • The violation field is populated by violated statute codes.
  • Rows represent citations, not stops, so we remove duplicates by grouping by the other fields.
  • contraband_found could potentially be derived from violation codes (drug/alcohol/weapons), but it would be less reliable and not necessarily comparable to how we defined contraband_found for other states, so we do not.
  • department_id was mapped directly from emdivision in the raw data.
  • violation was mapped directly from charge in the raw data.
  • location is a concatenation of raw_streetnbr, raw_street, and city (and note that city is actually county, and is mapped to county_name with light standardization).
  • Additional columns in raw data that may be of interest: statute, is_acciden.

CHANGE LOG FOR NEXT UPDATE:

  • More stringent deduping logic
  • Contraband found set to FALSE when NA and search conducted is true
  • Predication correction added to metadata
  • Six more cities
You can’t perform that action at this time.