# Clean and reformat CDR data from its multi-tab excel file into a single csv

### About the data

CDR data is tricky -- the form used by law enforcement has changed over time, first in 2005, then again in 2016. The data before 2005 is known to be be sparse and poorly enforced, so we ignore those entries. The 2005 and 2016 versions of the form have some overlap and some differences, so we must be careful in how we merge them.

**In this repo you can find blank versions of the [2005](https://github.com/texas-justice-initiative/data-processing/blob/master/forms/CDR%20Form%20Version%202005.pdf) and [2016](https://github.com/texas-justice-initiative/data-processing/blob/master/forms/CDR%20Form%20Version%202016.pdf) forms, to see for yourself exactly what fields are collected and how.**

### Datasets used


* Input:
  * `tji/deaths-in-custody/CDR - All Reports.xlsx`
  * `tji/auxiliary-datasets/agencies_and_counties`
* Output:
  * `tji/deaths-in-custody/cleaned_custodial_death_reports.csv`
  
##### Author: Everett Wetchler (everett.wetchler@gmail.com)

## Setup and read data

In [1]:
DTW_PROJECT_KEY_CDR = 'tji/deaths-in-custody'
RAW_FILENAME = 'original/CDR Reports All.xlsx'
CLEANED_FILENAME = 'cleaned_custodial_death_reports.csv'

In [2]:
import datetime

import datadotworld as dw
import json
import numpy as np
import pandas as pd

from lib.cleaning_tools import *

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

%load_ext watermark
%watermark -a "Everett Wetchler" -d -t -z -w -p numpy,pandas,datadotworld

Everett Wetchler 2018-07-16 10:34:01 CDT

numpy 1.14.5
pandas 0.23.3
datadotworld 1.6.0
watermark 1.6.1


In [3]:
from lib.standardize_police_agency_names import standardize_agency_name

In [4]:
datasets = dw.load_dataset('tji/auxiliary-datasets', force_update=True)
agency_county = datasets.dataframes['agencies_and_counties']
agency_county = agency_county.set_index('agency')['county'].to_dict()

In [5]:
dfs = read_dtw_excel(DTW_PROJECT_KEY_CDR, RAW_FILENAME)
dfs['Form Version 2005']['form_version'] = 'V_2005'
dfs['Form Version 2016']['form_version'] = 'V_2016'
dfs['Older Forms']['form_version'] = 'V_OLDER'
cdr = pd.concat([dfs['Form Version 2005'], dfs['Form Version 2016']])
print('Read %d raw CDR records. Ignoring %d from older form versions (pre-2005).' % (
    sum(len(x) for x in dfs), len(dfs['Older Forms'])))
print('Keeping %d records using form version 2005, and %d using version 2016 (keeping %d in total)' % (
    len(dfs['Form Version 2005']), len(dfs['Form Version 2016']), len(cdr)))

Writing excel file to temp file: /var/folders/dc/8cbxbsh515s908xl0zyprszm0000gn/T/tmph8r1wc82
Read 68 raw CDR records. Ignoring 5468 from older form versions (pre-2005).
Keeping 5156 records using form version 2005, and 1258 using version 2016 (keeping 6414 in total)


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """


### In theory, all these records should be for deaths in 2005 or later. Let's double check and drop any miscreants.

In [6]:
assert cdr['Death Date and Time'].isnull().sum() == 0

In [7]:
before = len(cdr)
cdr = cdr[cdr['Death Date and Time'].dt.year >= 2005]
after = len(cdr)
print('Dropped %d (of %d) reports for deaths before 2005, leaving %d' % (before - after, before, after))

Dropped 6 (of 6414) reports for deaths before 2005, leaving 6408


In [8]:
print("Data goes from %s to %s" % (cdr['Death Date and Time'].min().strftime("%Y-%m-%d"),
                                   cdr['Death Date and Time'].max().strftime("%Y-%m-%d")))

Data goes from 2005-01-01 to 2018-07-08


### A quick look at the raw data

In [9]:
dfs['Form Version 2005'].head()

Unnamed: 0,CDR: CDR Name,Report Date,Status,Version Type,Version Number,Department Type,Agency Name,Agency Phone,Agency Address,Agency City,Agency County,Agency State,Agency Zip,Director Salutation,Director First Name,Director Middle Name,Director Last Name,Reporter Name Original CDR,Reporter Email,Street Address,City,County,Longitude,Latitude,Census Tract,Type of Custody,Specific Type of Custody/Facility,Custody Type Facility,Entry Date Time,Entry Date Time N/A,Death Location,Death Location Elsewhere,First Name,Middle Name,Last Name,Suffix,Date of Birth,Sex,Ethnicity,Ethnicity Other,Death Date and Time,Age At Time Of Death,Date/Time of Custody or Incident,Custody Date NA,Medical Examinor/Coroner Evalution?,Manner of Death,Manner of Death Description,Death Reason,Medical Cause of Death,Medical Treatment,Medical Treatment Description,Who caused the death?,Death Causer Other,Type of Death Weapon,Death Weapon Other Description,Pre existing medical condition?,Means of Death,Means of Death Other,Offense 1,Offense 2,Offense 3,Were the Charges:,Type of Offense,"Type of Offense, Other",Injured By,Threaten the officer(s) involved,Resist being handcuffed or arrested?,Try to escape/flee from custody,"Grab, hit or fight with the officer(s)",Other Behavior,Specify Other Behavior,Appear intoxicated (alcohol or drugs),Use weapon threaten/assault officer(s),Entry Behavior,Under Restraint,Type of Restraint,"Other device, specify",form_version
0,16-89-C,2017-03-13 16:47:00,Submitted,ORIGINAL VERSION,1,,Seguin Police Department,830-379-2123,350 N. Guadalupe Street,Seguin,,TX,78155,Chief,Kevin,,Kelso,MC Meyers,mcmeyers@seguintexas.gov,US 90 at FM 1104,Kingsbury,Guadalupe,,,,Police Custody (pre-booking),Custody of Law Enforcement Personnel during/fl...,,NaT,1,At the crime/arrest scene,,Darrel,S.,Machemehl,,1972-01-20,Male,Anglo,,2016-04-26 23:00:00,44,2016-04-26 23:00:00,1,"Yes, results are available",Other - specify,Drew his weapon on officers and was shot by of...,,Multiple center fire rifle wounds,Not Applicable,,Law enforcement/correctional staff,,Rifle/Shotgun,,Not Applicable; cause of death was accidental ...,Firearm,,Aggravated Robbery,Evading Arrest with Vehicle,,Not filed at time of death,Violent Crime Against Persons; Serious Crime A...,,,Yes,No,Yes,No,1,Was seen putting a shotgun in his backpack the...,,Yes,,No,,,V_2005
1,16-85-C,2017-02-15 14:41:00,Submitted,ORIGINAL VERSION,1,Police,Odessa Police Dept.,432-335-3327,205 N. Grant,Odessa,,TX,79760,Chief,Timothy,,Burton,Sgt. David Lara,dlara@odessa-tx.gov,1209 Douglas,Odessa,Ector,,,,Police Custody (pre-booking),Custody of Law Enforcement Personnel during/fl...,,NaT,1,At the crime/arrest scene,,Jason,Wade,Meador,,1965-07-18,Male,Anglo,,2016-03-10 17:00:00,50,2016-03-10 16:25:00,0,"Yes, results pending",Other - specify,Peace Officer's use of force,Not applicable,Gun shot wound to the chest,Not Applicable,,Law enforcement/correctional staff,,Rifle/Shotgun,,Not Applicable; cause of death was accidental ...,Firearm,,Emergency Detention Order,,,Not filed at time of death,"Other, specify",Emergency Detention Order,Injured by Officer,Yes,,,,1,Subject was pouring gas on himself,Yes,Yes,,,,,V_2005
2,15-3-C,2016-12-29 14:24:00,Submitted,ORIGINAL VERSION,1,Sheriff,Wise County Sheriff's Office,940-627-5971,200 Rook Ramsey,Decatur,Wise,TX,76234,Sheriff,Lane,,Akin,N. Lane Akin,lane.akin@sheriff.co.wise.tx.us,Wooded area adjacent to Picket Run Road,Unincorporated,Montague,,,,Police Custody (pre-booking),Custody of Law Enforcement Personnel during/fl...,,NaT,1,At the crime/arrest scene,,Alan,Lance,Alverson,,1969-02-25,Male,Anglo,,2015-01-28 15:10:00,45,2015-01-28 15:10:00,0,"Yes, results are available",Homicide (includes Justifiable Homicide),,Injuries only,Multiple gunshot wounds,Not Applicable,,Law enforcement/correctional staff,,Handgun; Rifle/Shotgun,,Not Applicable; cause of death was accidental ...,Firearm,,"Attempted Capital Murder, Multiple Counts",Possession of a Controlled Substance,,Not filed at time of death,Violent Crime Against Persons,,Injured by Officer,Yes,,Yes,,0,,,Yes,,No,,,V_2005
3,16-34-C,2016-12-29 09:29:00,Submitted,ORIGINAL VERSION,1,Constable,Bexar County Constable's Office Precinct 3,210-335-4750,"8918 Tesoro Drive, Suite 301",San Antonio,Bexar,TX,78217,Constable,Mark,,Volvodich,Investigator Robert Hardcastle #1317,robert.hardcastle@bexar.org,Blanco Road South at Intersection of Huebner Road,San Antonio,Bexar,,,,Police Custody (pre-booking),Custody of Law Enforcement Personnel during/fl...,,NaT,1,"Elsewhere, specify",Roadway / Intersection,Rene,,Burrola,,1972-09-03,Male,Hispanic,,2016-10-28 21:23:00,44,2016-10-28 21:23:00,0,"Yes, results pending",Accidental,,Injuries only,Multiple Blunt force Injuries,Not Applicable,,Deceased,,,,Not Applicable; cause of death was accidental ...,Vehicle accident,,Evading Motor Vehicle,Recovered Stolen Motorcycle / Theft,,Not filed at time of death,"Other, specify",Evading Arrest Motor Vehicle,Injured by Self Accident,,,Yes,,0,,,No,,No,,,V_2005
4,15-2-C,2016-12-29 07:46:00,Submitted,ORIGINAL VERSION,1,Sheriff,Ellis County Sheriff's office,972-825-4901,300 S. Jackson St.,Waxahachie,Ellis,TX,75165,Sheriff,Johnny,,Brown,Lt. Shane Thompson,shane.thompson@co.ellis.tx.us,SB IH-35 E MM 389,Italy,Ellis,,,,Police Custody (pre-booking),Custody of Law Enforcement Personnel subsequen...,,NaT,1,At the crime/arrest scene,,Miguel,Alberto,Martinez,,1996-11-27,Male,Hispanic,,2015-06-03 05:45:00,18,2015-06-02 20:34:00,0,"Yes, results are available",Homicide (includes Justifiable Homicide),,Injuries only,Gunshot wounds,Not Applicable,,Law enforcement/correctional staff,,Handgun; Rifle/Shotgun,,Not Applicable; cause of death was accidental ...,Firearm,,Aggravated Assault with Deadly Weapon,Attempted Capital Murder,,Not filed at time of death,Violent Crime Against Persons,,Injured by Officer,Yes,,,,0,,,Yes,,"Yes, mark which restraint devices were used","Handcuffs; Other device, specify",In front,V_2005


In [10]:
dfs['Form Version 2016'].head()

Unnamed: 0,CDR: CDR Name,Version Type,Version Number,Report Date,Status,Agency Name,Agency Address,Agency City,Agency State,Agency Zip,Director Salutation,Director First Name,Director Middle Name,Director Last Name,Reporter Name,Reporter Email,First Name,Middle Name,Last Name,Suffix,Date of Birth,Sex,Race,Age At Time Of Death,Date/Time of Custody or Incident,Death Date and Time,Medical Examinor/Coroner Evalution?,Manner of Death,Manner of Death Description,Medical Cause of Death,Medical Treatment,Who caused the death?,Type of weapon that caused death?,"Other weapon, specify",Pre existing medical condition?,Means of Death,Means of Death Other,Street Address,City,County,Zip,Longitude,Latitude,Census Tract,Location Category,Other Location Category,Type of Custody,Specific Type of Custody/Facility,TDCJ - Specify Unit,Entry Date Time,Entry Date Time N/A,Death Location,Death Location Elsewhere,Other Agencies Respond?,Offense 1,Offense 2,Offense 3,Were the Charges:,Type of Offense,"Type of Offense, Other",Decedent display/use of weapons,Decedent Display or Use Weapon Details,Specify Weapon Used,Attempt to Injure Others?,Appear intoxicated (alcohol or drugs),Make suicidal statements?,Exhibit any mental health problems?,Exhibit any medical problems?,Barricade self or initiate standoff?,Resist being handcuffed or arrested?,Physically attempt/assault officer(s),Gain possession of officer's weapon,Verbally threaten other(s) including law,Escape or attempt to escape/flee custody,Attempt gain possession officer's weapon,Under Restraint,Type of Restraint,"Other device, specify",form_version
0,18-599-CJ,ORIGINAL VERSION,1,2018-07-09 11:05:00,Submitted,Travis County Sheriff's Dept.,5555 Airport Blvd.,Austin,TX,78751,Sheriff,Sally,,Hernandez,Inocencio Flores,inocencio.flores@traviscountytx.gov,Ronald,Lynn,Hall,,1962-08-01,Male,Anglo or White,55,2017-04-26 00:27:00,2018-07-08 10:19:00,"Yes, results pending",Pending autopsy results,,Pending autopsy results,Unknown,Not applicable,Not Applicable,,Could not be determined,Unknown,,3614 Bill Price Road,Del Valle,Travis,78617.0,,,,Law Enforcement Facility,,County Jail,Jail - single cell,,7/7/18 17:33,0,Medical facility,,No,Driving While Intoxicated - 3rd or more,,,A probation/parole violation,Alcohol / drug offense,,No,,,No,Unknown,No,Yes,Unknown,No,No,No,No,No,No,No,No,,,V_2016
1,18-597-P,ORIGINAL VERSION,1,2018-07-06 10:39:00,Submitted,TDCJ/Office of the Inspector General,1012 Veterans Memorial Parkway,Huntsville,TX,77320,Director,Bruce,,Toney,Ashley Gutierrez,ashley.gutierrez@tdcj.texas.gov,Chris,,Killgo,,1954-05-12,Male,Anglo or White,64,2008-01-11 00:00:00,2018-07-04 14:20:00,"No, evaluation not planned",Natural,,Locked in Syndrome,Unknown,Not applicable,Not Applicable,,Could not be determined,"Not applicable, cause of death was illness/nat...",,12071 FM 3522,Abilene,Jones,79601.0,,,,Law Enforcement Facility,,Penitentiary,"TDCJ, specify",Robertson Unit,1/11/08 0:00,0,Medical facility,,No,Aggravated Sexual Assault of a Child,,,Convicted,Crimes Against Child(ren),,No,,,No,No,No,No,Yes,No,No,No,No,No,No,No,No,,,V_2016
2,18-595-P,ORIGINAL VERSION,1,2018-07-06 10:33:00,Submitted,TDCJ/Office of the Inspector General,1012 Veterans Memorial Parkway,Huntsville,TX,77320,Director,Bruce,,Toney,Ashley Gutierrez,ashley.gutierrez@tdcj.texas.gov,Darryl,,Wright,,1966-07-15,Male,Black or African American,51,2015-09-03 00:00:00,2018-07-02 04:54:00,"Yes, results pending",Pending autopsy results,,"Hepatitis C, Liver Disease",Yes,Not applicable,Not Applicable,,Could not be determined,"Not applicable, cause of death was illness/nat...",,9601 Spur 591,Amarillo,Potter,79107.0,,,,Law Enforcement Facility,,Penitentiary,"TDCJ, specify",Bill Clements Unit,9/3/15 0:00,0,Law enforcement facility/booking center,,No,Possession of a Controlled Substance,,,Convicted,Alcohol / drug offense,,Unknown,,,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,No,,,V_2016
3,18-593-P,ORIGINAL VERSION,1,2018-07-06 10:27:00,Submitted,TDCJ/Office of the Inspector General,1012 Veterans Memorial Parkway,Huntsville,TX,77320,Director,Bruce,,Toney,Ashley Gutierrez,ashley.gutierrez@tdcj.texas.gov,Craig,,Anderson,,1960-02-27,Male,Black or African American,58,1990-07-19 00:00:00,2018-06-30 15:44:00,"Yes, results pending",Homicide (includes Justifiable Homicide),,Blunt Force trauma to the back of the head,Not Applicable,Other detainee(s),Not Applicable,,Not Applicable; cause of death was accidental ...,"Other, specify",Physical blow caused decedent to fall and stri...,1200 FM 655,Rosharon,Brazoria,77583.0,,,,Law Enforcement Facility,,Penitentiary,"TDCJ, specify",Stringfellow Unit,7/19/90 0:00,0,Medical facility,,No,Robbery,,,Convicted,Violent Crime Against Persons,,Unknown,,,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,No,,,V_2016
4,18-591-P,ORIGINAL VERSION,1,2018-07-06 10:20:00,Submitted,TDCJ/Office of the Inspector General,1012 Veterans Memorial Parkway,Huntsville,TX,77320,Director,Bruce,,Toney,Ashley Gutierrez,ashley.gutierrez@tdcj.texas.gov,Thomas,,Snyder,,1957-12-27,Male,Anglo or White,60,2018-02-23 00:00:00,2018-06-26 07:48:00,"No, evaluation not planned",Natural,,Liver and Lung Cancer,Unknown,Not applicable,Not Applicable,,Could not be determined,"Not applicable, cause of death was illness/nat...",,2664 FM 2054,Tennessee Colony,Anderson,75886.0,,,,Law Enforcement Facility,,Penitentiary,"TDCJ, specify",Michael Unit,2/23/18 0:00,0,Medical facility,,No,Driving while Intoxicated,,,Convicted,Alcohol / drug offense,,Unknown,,,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,No,,,V_2016


## Helper methods

In [11]:
def show_notnull_chart(cdr, vertical=False):
    '''Show which columns have null values, how often, and break down by form version.'''
    versions = sorted(set(cdr.form_version))
    frames = []
    for v in versions:
        fr = cdr[cdr.form_version == v]
        s = fr.notnull().mean()
        frames.append(s)
    frame = pd.concat(frames, axis=1)
    frame.columns = versions
    frame.sort_index(inplace=True)
    if not vertical:
        frame = frame.T
    return frame.style.background_gradient(cmap='RdYlGn', axis=(0 if vertical else 1))

In [12]:
OTHER_SPECIFY = 'OTHER, SPECIFY'

-----
## Begin Cleaning
-----

### First, take a look at how often our columns are populated, broken down by form version (since each ask slightly different questions)

In [13]:
show_notnull_chart(cdr)

Unnamed: 0,Age At Time Of Death,Agency Address,Agency City,Agency County,Agency Name,Agency Phone,Agency State,Agency Zip,Appear intoxicated (alcohol or drugs),Attempt gain possession officer's weapon,Attempt to Injure Others?,Barricade self or initiate standoff?,CDR: CDR Name,Census Tract,City,County,Custody Date NA,Custody Type Facility,Date of Birth,Date/Time of Custody or Incident,Death Causer Other,Death Date and Time,Death Location,Death Location Elsewhere,Death Reason,Death Weapon Other Description,Decedent Display or Use Weapon Details,Decedent display/use of weapons,Department Type,Director First Name,Director Last Name,Director Middle Name,Director Salutation,Entry Behavior,Entry Date Time,Entry Date Time N/A,Escape or attempt to escape/flee custody,Ethnicity,Ethnicity Other,Exhibit any medical problems?,Exhibit any mental health problems?,First Name,Gain possession of officer's weapon,"Grab, hit or fight with the officer(s)",Injured By,Last Name,Latitude,Location Category,Longitude,Make suicidal statements?,Manner of Death,Manner of Death Description,Means of Death,Means of Death Other,Medical Cause of Death,Medical Examinor/Coroner Evalution?,Medical Treatment,Medical Treatment Description,Middle Name,Offense 1,Offense 2,Offense 3,Other Agencies Respond?,Other Behavior,Other Location Category,"Other device, specify","Other weapon, specify",Physically attempt/assault officer(s),Pre existing medical condition?,Race,Report Date,Reporter Email,Reporter Name,Reporter Name Original CDR,Resist being handcuffed or arrested?,Sex,Specific Type of Custody/Facility,Specify Other Behavior,Specify Weapon Used,Status,Street Address,Suffix,TDCJ - Specify Unit,Threaten the officer(s) involved,Try to escape/flee from custody,Type of Custody,Type of Death Weapon,Type of Offense,"Type of Offense, Other",Type of Restraint,Type of weapon that caused death?,Under Restraint,Use weapon threaten/assault officer(s),Verbally threaten other(s) including law,Version Number,Version Type,Were the Charges:,Who caused the death?,Zip,form_version
V_2005,1,1,1,0.999612,1,1,0.993398,1,0.998447,0.0,0,0.0,1,0,0.998835,1,1,0.512427,1,1,0.00912621,1,1,0.0192233,0.999806,0.0333981,0.0,0,0.999806,1,1,0.202524,1,0.141748,0.693592,1,0.0,1,0.00427184,0.0,0.0,1,0.0,0.998641,0.999806,1,0,0,0,0.0,1,0.589126,1,0.0460194,0.999417,1,0.999806,0.279223,0.559806,0.998835,0.261553,0.0819417,0,1,0.0,0.0159223,0.0,0.0,1,0,1.0,1.0,0,0.999806,0.998641,1,1.0,0.0341748,0.0,1,0.999806,0.0706796,0.0,0.999417,0.999029,1,0.999417,0.00194175,0.149126,0.00038835,0.0,0.999806,1,0.0,1,1,1,1,0.0,1
V_2016,1,1,1,0.0,1,0,1.0,1,0.888712,0.685215,1,0.682035,1,0,0.999205,1,0,0.0,1,1,0.0,1,1,0.018283,0.0,0.0,0.163752,1,0.0,1,1,0.0166932,1,0.0,0.769475,1,0.682035,0,0.0,0.892687,0.887917,1,0.68442,0.0,0.0,1,0,1,0,0.887917,1,0.0119237,1,0.0286169,0.99682,1,1.0,0.0,0.36566,0.994436,0.267886,0.100954,1,0,0.0405405,0.0063593,0.0310016,0.686804,1,1,0.999205,0.999205,1,0.0,0.672496,1,0.949126,0.0,0.0166932,1,1.0,0.04531,0.655803,0.0,0.0,1,0.0,0.999205,0.121622,0.0612083,0.999205,1.0,0,0.686804,1,1,1,1,0.999205,1


### Start by dropping columns that we either don't need or haven't written code to clean yet.

In [14]:
keep_text = '''Both forms

- Age At Time Of Death
- Agency Address
- Agency City
- Agency Name
- Agency Zip
- CDR: CDR Name
- Census Tract
- City
- County
- Date of Birth
- Date/Time of Custody or Incident
- Death Date and Time
- Death Location
- Death Location Elsewhere
- Entry Date Time
- Entry Date Time N/A
- First Name
- Middle Name
- Last Name
- Latitude
- Longitude
- Suffix
- Manner of Death
- Manner of Death Description
- Means of Death
- Means of Death Other
- Medical Cause of Death
- Medical Examinor/Coroner Evalution?
- Medical Treatment
- Offense 1
- Offense 2
- Offense 3
- Pre existing medical condition?
- Report Date
- Sex
- Specific Type of Custody/Facility
- Street Address
- Type of Custody
- Type of Offense
- Type of Offense, Other
- Version Number
- Version Type
- Were the Charges:
- Who caused the death?
- form_version

2005 form only

- Agency County
- Custody Date NA
- Death Causer Other
- Department Type
- Entry Behavior
- Ethnicity
- Ethnicity Other
- Other Behavior
- Specify Other Behavior

2016 form only

- Exhibit any medical problems?
- Exhibit any mental health problems?
- Make suicidal statements?
- Race'''
keep_cols = []
drop_cols = set(cdr.columns)
for line in keep_text.splitlines():
    if line.startswith('- '):
        colname = line[2:]
        keep_cols.append(colname)
        drop_cols.remove(colname)

### Rename columns to be more machine friendly (lowercase, snake_case, and remove non-alphanumeric characters)

In [15]:
col_renames = {}
for c in keep_cols:
    new_name = ''.join([ch if ch.isalnum() else ' ' for ch in c.lower()])
    new_name = '_'.join(new_name.strip().split())
    col_renames[c] = new_name

In [16]:
cdr = cdr[list(col_renames.keys())]
cdr.rename(columns=col_renames, inplace=True)
cdr.head()

Unnamed: 0,age_at_time_of_death,agency_address,agency_city,agency_name,agency_zip,cdr_cdr_name,census_tract,city,county,date_of_birth,date_time_of_custody_or_incident,death_date_and_time,death_location,death_location_elsewhere,entry_date_time,entry_date_time_n_a,first_name,middle_name,last_name,latitude,longitude,suffix,manner_of_death,manner_of_death_description,means_of_death,means_of_death_other,medical_cause_of_death,medical_examinor_coroner_evalution,medical_treatment,offense_1,offense_2,offense_3,pre_existing_medical_condition,report_date,sex,specific_type_of_custody_facility,street_address,type_of_custody,type_of_offense,type_of_offense_other,version_number,version_type,were_the_charges,who_caused_the_death,form_version,agency_county,custody_date_na,death_causer_other,department_type,entry_behavior,ethnicity,ethnicity_other,other_behavior,specify_other_behavior,exhibit_any_medical_problems,exhibit_any_mental_health_problems,make_suicidal_statements,race
0,44,350 N. Guadalupe Street,Seguin,Seguin Police Department,78155,16-89-C,,Kingsbury,Guadalupe,1972-01-20,2016-04-26 23:00:00,2016-04-26 23:00:00,At the crime/arrest scene,,NaT,1,Darrel,S.,Machemehl,,,,Other - specify,Drew his weapon on officers and was shot by of...,Firearm,,Multiple center fire rifle wounds,"Yes, results are available",Not Applicable,Aggravated Robbery,Evading Arrest with Vehicle,,Not Applicable; cause of death was accidental ...,2017-03-13 16:47:00,Male,Custody of Law Enforcement Personnel during/fl...,US 90 at FM 1104,Police Custody (pre-booking),Violent Crime Against Persons; Serious Crime A...,,1,ORIGINAL VERSION,Not filed at time of death,Law enforcement/correctional staff,V_2005,,1.0,,,,Anglo,,1.0,Was seen putting a shotgun in his backpack the...,,,,
1,50,205 N. Grant,Odessa,Odessa Police Dept.,79760,16-85-C,,Odessa,Ector,1965-07-18,2016-03-10 16:25:00,2016-03-10 17:00:00,At the crime/arrest scene,,NaT,1,Jason,Wade,Meador,,,,Other - specify,Peace Officer's use of force,Firearm,,Gun shot wound to the chest,"Yes, results pending",Not Applicable,Emergency Detention Order,,,Not Applicable; cause of death was accidental ...,2017-02-15 14:41:00,Male,Custody of Law Enforcement Personnel during/fl...,1209 Douglas,Police Custody (pre-booking),"Other, specify",Emergency Detention Order,1,ORIGINAL VERSION,Not filed at time of death,Law enforcement/correctional staff,V_2005,,0.0,,Police,,Anglo,,1.0,Subject was pouring gas on himself,,,,
2,45,200 Rook Ramsey,Decatur,Wise County Sheriff's Office,76234,15-3-C,,Unincorporated,Montague,1969-02-25,2015-01-28 15:10:00,2015-01-28 15:10:00,At the crime/arrest scene,,NaT,1,Alan,Lance,Alverson,,,,Homicide (includes Justifiable Homicide),,Firearm,,Multiple gunshot wounds,"Yes, results are available",Not Applicable,"Attempted Capital Murder, Multiple Counts",Possession of a Controlled Substance,,Not Applicable; cause of death was accidental ...,2016-12-29 14:24:00,Male,Custody of Law Enforcement Personnel during/fl...,Wooded area adjacent to Picket Run Road,Police Custody (pre-booking),Violent Crime Against Persons,,1,ORIGINAL VERSION,Not filed at time of death,Law enforcement/correctional staff,V_2005,Wise,0.0,,Sheriff,,Anglo,,0.0,,,,,
3,44,"8918 Tesoro Drive, Suite 301",San Antonio,Bexar County Constable's Office Precinct 3,78217,16-34-C,,San Antonio,Bexar,1972-09-03,2016-10-28 21:23:00,2016-10-28 21:23:00,"Elsewhere, specify",Roadway / Intersection,NaT,1,Rene,,Burrola,,,,Accidental,,Vehicle accident,,Multiple Blunt force Injuries,"Yes, results pending",Not Applicable,Evading Motor Vehicle,Recovered Stolen Motorcycle / Theft,,Not Applicable; cause of death was accidental ...,2016-12-29 09:29:00,Male,Custody of Law Enforcement Personnel during/fl...,Blanco Road South at Intersection of Huebner Road,Police Custody (pre-booking),"Other, specify",Evading Arrest Motor Vehicle,1,ORIGINAL VERSION,Not filed at time of death,Deceased,V_2005,Bexar,0.0,,Constable,,Hispanic,,0.0,,,,,
4,18,300 S. Jackson St.,Waxahachie,Ellis County Sheriff's office,75165,15-2-C,,Italy,Ellis,1996-11-27,2015-06-02 20:34:00,2015-06-03 05:45:00,At the crime/arrest scene,,NaT,1,Miguel,Alberto,Martinez,,,,Homicide (includes Justifiable Homicide),,Firearm,,Gunshot wounds,"Yes, results are available",Not Applicable,Aggravated Assault with Deadly Weapon,Attempted Capital Murder,,Not Applicable; cause of death was accidental ...,2016-12-29 07:46:00,Male,Custody of Law Enforcement Personnel subsequen...,SB IH-35 E MM 389,Police Custody (pre-booking),Violent Crime Against Persons,,1,ORIGINAL VERSION,Not filed at time of death,Law enforcement/correctional staff,V_2005,Ellis,0.0,,Sheriff,,Hispanic,,0.0,,,,,


## Add rows from the old master file to get BJS inpatient deaths (will dedup after)

In [17]:
datasets = dw.load_dataset('tji/raw-and-processing', force_update=True)
old_master = datasets.dataframes['reformatted_cdr_2017_master_file']
old_master['form_version'] = 'V_BJS'
print(old_master.shape)
old_master.head()

(7735, 47)


Unnamed: 0,first_name,middle_name,last_name,suffix,age_at_time_of_death,sex,date_of_birth,offense_1,offense_2,offense_3,report_date,death_date_and_time,cdr_cdr_name,agency_name,agency_address,agency_city,agency_zip,agency_county,who_caused_the_death,death_causer_other,date_time_of_custody_or_incident,custody_date_na,type_of_custody,specific_type_of_custody_facility,street_address,city,county,death_location,death_location_elsewhere,means_of_death,means_of_death_other,manner_of_death,manner_of_death_description,entry_date_time,entry_date_time_n_a,medical_cause_of_death,pre_existing_medical_condition,medical_examinor_coroner_evalution,medical_treatment,ethnicity,ethnicity_other,other_behavior,entry_behavior,were_the_charges,type_of_offense,type_of_offense_other,form_version
0,Clyde,L,Pady,,80.0,Male,2024-10-01,SEXUAL ASSLT/RAPE OF CHILD,SEXUAL ASSLT/RAPE OF CHILD,SEXUAL ASSLT/RAPE OF CHILD,3/1/2005 0:00,2005-01-02,PRISON20051001,Texas Department Of Criminal Justice,,,,,,,5/30/1995 0:00,,Penitentiary,TDCJ,,Tenessee Colony,,,,,,Natural Causes/Illness,Prostate/Lung Cancer,,,,Developed condition after admission,No,Yes,Anglo or White,,,,Convicted,child,,V_BJS
1,Winston,,Wilkins,,61.0,Male,1943-09-27,"STOLEN VEHICLE, UNAUTHORIZED USE",0,0,3/1/2005 0:00,2005-01-02,PRISON20051002,Texas Department Of Criminal Justice,,,,,,,9/8/2003 0:00,,Penitentiary,TDCJ,,Galveston,,,,,,Natural Causes/Illness,Lung Cancer,,,,Pre-existing medical condition,No,Yes,Anglo or White,,,,Convicted,property,,V_BJS
2,Steven,,Fleming,,37.0,Male,1967-10-30,SEXUAL ASSLT/RAPE AGG/SEX ABUSE,0,0,3/1/2005 0:00,2005-01-04,PRISON20051004,Texas Department Of Criminal Justice,,,,,,,10/7/1988 0:00,,Penitentiary,TDCJ,,Livingston,,,,,,Natural Causes/Illness,,,,,Pre-existing medical condition,No,Yes,Black or African American,,,,Convicted,violent,,V_BJS
3,James,B,Walker,,61.0,Male,1943-08-27,AGG SEXUAL ASSLT,AGG KIDNAPPING,0,3/1/2005 0:00,2005-01-04,PRISON20051005,Texas Department Of Criminal Justice,,,,,,,4/23/2004 0:00,,Penitentiary,TDCJ,,Tenessee Colony,,,,,,Natural Causes/Illness,Hepatocellular Cancer,,,,Developed condition after admission,No,Yes,Black or African American,,,,Convicted,violent,,V_BJS
4,David,L,Jones,,45.0,Male,1960-05-01,INDECENCY W/ A CHILD SEXUAL CONTACT,0,0,3/1/2005 0:00,2005-01-06,PRISON20051007,Texas Department Of Criminal Justice,,,,,,,11/8/2001 0:00,,Penitentiary,TDCJ,,Abilene,,,,,,Natural Causes/Illness,Cardiomyopathy Heart Failure,,,,Pre-existing medical condition,No,Yes,Anglo or White,,,,Convicted,child,,V_BJS


In [18]:
cdr.shape, old_master.shape

((6408, 58), (7735, 47))

#### There will be a few columns that the old master file doesn't have, which is fine

In [19]:
set(cdr.columns) - set(old_master.columns)

{'census_tract',
 'department_type',
 'exhibit_any_medical_problems',
 'exhibit_any_mental_health_problems',
 'latitude',
 'longitude',
 'make_suicidal_statements',
 'race',
 'specify_other_behavior',
 'version_number',
 'version_type'}

#### However, it shouldn't have any novel columns that the *other* data doesn't have

In [20]:
assert(len(set(old_master.columns) - set(cdr.columns)) == 0)

In [21]:
cdr = pd.concat([cdr, old_master])
cdr.reset_index(inplace=True, drop=True)
cdr.sort_values('form_version', inplace=True)
cdr.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,age_at_time_of_death,agency_address,agency_city,agency_county,agency_name,agency_zip,cdr_cdr_name,census_tract,city,county,custody_date_na,date_of_birth,date_time_of_custody_or_incident,death_causer_other,death_date_and_time,death_location,death_location_elsewhere,department_type,entry_behavior,entry_date_time,entry_date_time_n_a,ethnicity,ethnicity_other,exhibit_any_medical_problems,exhibit_any_mental_health_problems,first_name,form_version,last_name,latitude,longitude,make_suicidal_statements,manner_of_death,manner_of_death_description,means_of_death,means_of_death_other,medical_cause_of_death,medical_examinor_coroner_evalution,medical_treatment,middle_name,offense_1,offense_2,offense_3,other_behavior,pre_existing_medical_condition,race,report_date,sex,specific_type_of_custody_facility,specify_other_behavior,street_address,suffix,type_of_custody,type_of_offense,type_of_offense_other,version_number,version_type,were_the_charges,who_caused_the_death
0,44.0,350 N. Guadalupe Street,Seguin,,Seguin Police Department,78155,16-89-C,,Kingsbury,Guadalupe,1.0,1972-01-20,2016-04-26 23:00:00,,2016-04-26 23:00:00,At the crime/arrest scene,,,,NaT,1.0,Anglo,,,,Darrel,V_2005,Machemehl,,,,Other - specify,Drew his weapon on officers and was shot by of...,Firearm,,Multiple center fire rifle wounds,"Yes, results are available",Not Applicable,S.,Aggravated Robbery,Evading Arrest with Vehicle,,1,Not Applicable; cause of death was accidental ...,,2017-03-13 16:47:00,Male,Custody of Law Enforcement Personnel during/fl...,Was seen putting a shotgun in his backpack the...,US 90 at FM 1104,,Police Custody (pre-booking),Violent Crime Against Persons; Serious Crime A...,,1.0,ORIGINAL VERSION,Not filed at time of death,Law enforcement/correctional staff
3437,54.0,"2503 Lake Road, Suite 5",Huntsville,Walker,Texas Department Of Criminal Justice,77340,PA10082P,,Navasota,Grimes,0.0,1955-12-03,2006-10-10 00:00:00,,2010-05-10 06:15:00,At law enforcement facility,,STAGENCY,,2006-10-10 00:00:00,0.0,African-American,,,,John,V_2005,DeJean,,,,Natural Causes/Illness,venous infarct of the small bowel,Not applicable; cause of death was intoxicatio...,,venous infarct of the small bowel,"Yes, results are available",Not Applicable,,Theft,False Representation of a Lawyer,,0,Don't know,,2010-05-11 10:22:00,Male,TDCJ,,2400 Wallace Pack,,Penitentiary,,,1.0,ORIGINAL VERSION,Convicted,"Not applicable; cause of death was suicide, in..."
3436,43.0,"2503 Lake Road, Suite 5",Huntsville,Walker,Texas Department Of Criminal Justice,77340,PA10083P,,Beaumont,Jefferson,0.0,1967-04-17,2008-03-12 00:00:00,,2010-04-27 18:02:00,At law enforcement facility,,STAGENCY,,2009-08-25 00:00:00,0.0,Anglo,,,,Joel,V_2005,Lewis,,,,Accidental injury to self,,Not applicable; cause of death was intoxicatio...,,heroin toxicity,"Yes, results are available",Not Applicable,Thomas,Possession of Controlled Substance,,,0,Don't know,,2010-05-11 10:40:00,Male,TDCJ,,3060 FM 3514,,Penitentiary,,,1.0,ORIGINAL VERSION,Convicted,"Not applicable; cause of death was suicide, in..."
3435,40.0,601 54th Street,Galveston,Galveston,Galveston County Sheriff's Dept.,77550,PA10084CJ,,Galveston,Galveston,0.0,1969-09-28,2009-10-13 06:45:00,,2010-05-10 02:11:00,At law enforcement facility,,SHERIFF,,2009-10-14 23:47:00,0.0,Anglo,,,,Dennis,V_2005,Bradford,,,,Suicide,,"Hanging, strangulation",,Asphyxiation,"Yes, results are available",Not Applicable,Earl,Attempted Capital Murder of a Child,,,0,Not Applicable; cause of death was accidental ...,,2010-05-12 11:09:00,Male,Jail - single cell,,5700 Ave H,,County Jail,,,1.0,ORIGINAL VERSION,Filed,"Not applicable; cause of death was suicide, in..."
3434,60.0,"2503 Lake Road, Suite 5",Huntsville,Walker,Texas Department Of Criminal Justice,77340,PA10085P,,Palestine,Anderson,0.0,1949-11-09,1990-07-18 00:00:00,,2010-04-01 04:54:00,At law enforcement facility,,STAGENCY,,1990-07-18 00:00:00,0.0,African-American,,,,Milton,V_2005,Dixon,,,,Natural Causes/Illness,Adennocarcinoma of duodenum (terminal cancer o...,Not applicable; cause of death was intoxicatio...,,Terminal cancer,"No, evaluation not planned",Yes,W.,Sexual Aslt,,,0,Don't know,,2010-05-13 09:02:00,Male,TDCJ,,1400 FM 3452,,Penitentiary,,,1.0,ORIGINAL VERSION,Convicted,"Not applicable; cause of death was suicide, in..."


### Convert date columns

In [22]:
convert_date_cols(cdr)

Converting column date_of_birth to datetime
Converting column date_time_of_custody_or_incident to datetime
Converting column death_date_and_time to datetime
Converting column entry_date_time to datetime
Replaced 1361 bad values with NaT:
Unique bad values: {'0000-00-00 00:00:00'}
Converting column report_date to datetime


### Upcase string cell contents

In [23]:
upcase_strip_string_cells(cdr)

### Add a `death_date` column (without the death time)

In [24]:
cdr['death_date'] = cdr.death_date_and_time.apply(lambda dt: datetime.date(dt.year, dt.month, dt.day))

In [25]:
cdr_full = cdr.copy()

## De-duplicate
* We de-dup by finding pairs (or sets) of records that have the same key personal information fields (`pii`).
* In theory we should able to de-dup by the CDR record id, but sometimes we'll see multiple records for the same person with different record IDs.

In [26]:
pii_cols = ['first_name', 'last_name', 'date_of_birth', 'death_date']
dup_pii = cdr.duplicated(subset=pii_cols)
dup_pii.name = 'dup_pii'
dup_record_id = cdr.duplicated(subset='cdr_cdr_name')
dup_record_id.name = 'dup_record_id'

In [27]:
pd.crosstab(dup_record_id, dup_pii)

dup_pii,False,True
dup_record_id,Unnamed: 1_level_1,Unnamed: 2_level_1
False,8832,47
True,40,5224


#### As you can see, MOST records with duplicate personal information also use the same record id, but a few do not.

In [28]:
dropped = cdr[dup_pii]
cdr = cdr[~dup_pii]

In [29]:
dup_record_id = cdr.duplicated(subset='cdr_cdr_name', keep=False)  # Note keep=False to mark both in a duplicate pair

In [30]:
print("Removed %d duplicate rows (%s)" % (len(dropped), ['cdr_cdr_name'] + pii_cols))
print("%d remain" % len(cdr))
print("For now, leaving %d questionable records where there exists another record with the same record id but different personal information" %
     dup_record_id.sum())

Removed 5271 duplicate rows (['cdr_cdr_name', 'first_name', 'last_name', 'date_of_birth', 'death_date'])
8872 remain
For now, leaving 80 questionable records where there exists another record with the same record id but different personal information


In [31]:
cdr[dup_record_id].sort_values(['cdr_cdr_name'])[['cdr_cdr_name'] + pii_cols].head(6)

Unnamed: 0,cdr_cdr_name,first_name,last_name,date_of_birth,death_date
5471,18-148-P,PEDRO,ZAMORA,1971-09-23,2018-02-23
5470,18-148-P,BRUCE,DANNER,1971-10-10,2018-02-26
5108,PA05042P,ARTHUR,ETTER,2005-05-05,2005-03-01
6520,PA05042P,ARTHUR,ETTER,2023-03-02,2005-03-01
6695,PA05142CJ,FRED,GREEN,1952-07-14,2005-07-14
5008,PA05142CJ,FRED,GREEN,2005-06-29,2005-07-14


## Standardize and merge race columns -- the 2005 form calls it 'ethnicity', the 2016 'race'

#### Have a look at the values first

In [32]:
cdr.race.value_counts()

ANGLO OR WHITE               508
HISPANIC OR LATINO           363
BLACK OR AFRICAN AMERICAN    299
OTHER                          5
ASIAN OR PACIFIC ISLANDER      2
Name: race, dtype: int64

In [33]:
cdr.ethnicity.value_counts()

ANGLO                               2094
AFRICAN-AMERICAN                    1513
HISPANIC                            1423
ANGLO OR WHITE                      1095
BLACK OR AFRICAN AMERICAN            754
HISPANIC OR LATINO                   691
OTHER/DON'T KNOW                      39
ASIAN                                 34
OTHER                                 22
AMERICAN INDIAN/ALASKA NATIVE         13
MIDDLE EAST                           12
NATIVE HAWAIIAN/PACIFIC ISLANDER       5
Name: ethnicity, dtype: int64

In [34]:
# Collapse the various OTHER-like values
cdr.loc[cdr.ethnicity.fillna('').str.contains('OTHER'), 'ethnicity'] = 'OTHER'

In [35]:
# When choosing the 'Other' ethnicity in the 2005 form version, there
# is a subsequent field to specify. Though clearly some of them are not
# truly 'other' ethnicities. See:
cdr.ethnicity_other.value_counts()

WHITE                  7
UNKNOWN                5
CAUCASIAN              4
0                      3
UNITED KINGDOM         1
BELIZE, BRAZILIAN      1
ARABIAN                1
WHITE NON HISPANIC     1
SUDANESE BLACK         1
ANGLO & MIDDLE EAST    1
CUBAN                  1
Name: ethnicity_other, dtype: int64

In [36]:
# Replace the nonsensical 0 with NaN
cdr.loc[cdr.ethnicity_other.astype(str) == '0', 'ethnicity_other'] = None

In [37]:
# Let's make sure nobody is filling out the "other ethnicity" column when they shouldn't...
cdr[(~(cdr.ethnicity == 'OTHER') & cdr.ethnicity_other.notnull())][['ethnicity', 'ethnicity_other']]

Unnamed: 0,ethnicity,ethnicity_other


In [38]:
# Good. Let's transfer those specified ethnicity_other values into
# the 'ethnicity' column, so we can merge everything at once.
other_eth = (cdr.ethnicity == 'OTHER')
print('Merging %d "ethnicity_other" values into the main "ethnicity" column' % other_eth.sum())
cdr.loc[other_eth, 'ethnicity'] = cdr.ethnicity_other[other_eth]
cdr.drop('ethnicity_other', axis=1, inplace=True)

Merging 61 "ethnicity_other" values into the main "ethnicity" column


In [39]:
# Make a single 'race' column that has merged, simplified values of race or ethnicity.
race_eth_list = []
for race, eth in zip(cdr.race, cdr.ethnicity):
    # Only one of (race, eth) should be set
    assert pd.isnull(race) or pd.isnull(eth)
    if pd.isnull(race):
        if pd.isnull(eth):
            race_eth_list.append(None)
            continue
        x = eth
    else:
        x = race
    race_eth_list.append(x)

In [40]:
cdr['race'] = race_eth_list
cdr.drop('ethnicity', axis=1, inplace=True)
cdr.race.value_counts()

ANGLO                               2094
ANGLO OR WHITE                      1603
AFRICAN-AMERICAN                    1513
HISPANIC                            1423
HISPANIC OR LATINO                  1054
BLACK OR AFRICAN AMERICAN           1053
ASIAN                                 34
AMERICAN INDIAN/ALASKA NATIVE         13
MIDDLE EAST                           12
WHITE                                  7
OTHER                                  5
UNKNOWN                                5
NATIVE HAWAIIAN/PACIFIC ISLANDER       5
CAUCASIAN                              4
ASIAN OR PACIFIC ISLANDER              2
ARABIAN                                1
BELIZE, BRAZILIAN                      1
WHITE NON HISPANIC                     1
SUDANESE BLACK                         1
UNITED KINGDOM                         1
ANGLO & MIDDLE EAST                    1
CUBAN                                  1
Name: race, dtype: int64

### Regularize race and several other columns

In [41]:
race_before = cdr.race.copy()
race_before.name = 'race_before'

In [42]:
standardize_gender_cols(cdr)
standardize_race_cols(cdr)
numericalize_age_cols(cdr)

cdr.race.value_counts()

Numericalizing column age_at_time_of_death


WHITE       3710
BLACK       2567
HISPANIC    2477
OTHER         80
Name: race, dtype: int64

In [43]:
cdr.groupby([race_before, 'race']).size()

race_before                       race    
AFRICAN-AMERICAN                  BLACK       1513
AMERICAN INDIAN/ALASKA NATIVE     OTHER         13
ANGLO                             WHITE       2094
ANGLO & MIDDLE EAST               WHITE          1
ANGLO OR WHITE                    WHITE       1603
ARABIAN                           OTHER          1
ASIAN                             OTHER         34
ASIAN OR PACIFIC ISLANDER         OTHER          2
BELIZE, BRAZILIAN                 OTHER          1
BLACK OR AFRICAN AMERICAN         BLACK       1053
CAUCASIAN                         WHITE          4
CUBAN                             OTHER          1
HISPANIC                          HISPANIC    1423
HISPANIC OR LATINO                HISPANIC    1054
MIDDLE EAST                       OTHER         12
NATIVE HAWAIIAN/PACIFIC ISLANDER  OTHER          5
OTHER                             OTHER          5
SUDANESE BLACK                    BLACK          1
UNITED KINGDOM                    OTHER

## Standardize agency names, and add agency counties

In [44]:
# Standardize agency name (so we can join/compare across datasets)
cdr['agency_name'] = cdr['agency_name'].apply(standardize_agency_name)

# Lookup county name by agency name. If this fails, fall back
# on the county specified in the form, if it exists.
cdr['agency_county'] = cdr['agency_county'].str.upper()
county_lookup = cdr['agency_name'].apply(lambda name: agency_county.get(name, np.nan))
cdr['agency_county'] = county_lookup.fillna(cdr['agency_county'])

# Manually handle one major agency
cdr.loc[cdr['agency_name'] == 'TEXAS DEPT OF CRIMINAL JUSTICE', 'agency_county'] = 'STATE'

In [45]:
# Check that we are only missing counties for a paltry few records now.
cdr[cdr['agency_county'].isnull()]['agency_name'].value_counts()

TEXAS JUVENILE JUSTICE DEPT OFFICE OF INSPECTOR GENERAL     3
HILDAGO CO SHERIFFS OFFICE                                  2
GULF COAST VIOLENT OFFENDERS TASK FORCE                     1
ELLIS CO SHERIFFS OFFICE WAYNE MCCOLLUM DETENTION CENTER    1
BEAUMONT INDEPENDENT SCHOOL DIST POLICE DEPT                1
NORTHSIDE INDEPENDENT SCHOOL DIST POLICE                    1
ECTOR CO SHERIFFS OFC                                       1
SUGARLAND POLICE DEPT                                       1
NACONA POLICE DEPT                                          1
METROPOLITAN TRANSIT AUTH OF HARRIS CO                      1
Name: agency_name, dtype: int64

## FIx death-information related columns

#### The two form versions offer slightly different options(e.g. V_2005 uses 'AT MEDICAL FACILITY' while V_2016 uses 'MEDICAL FACILITY'). Collapse values to stabilize the options across form versions, and roll up rare values into a single 'OTHER' value.

### Fix: `death_location`

In [46]:
pd.crosstab(cdr.death_location, cdr.form_version)

form_version,V_2005,V_2016,V_BJS
death_location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AT LAW ENFORCEMENT FACILITY,1265,0,13
AT MEDICAL FACILITY,3029,0,26
AT THE CRIME/ARREST SCENE,662,0,7
DEAD ON ARRIVAL AT MEDICAL FACILITY,0,27,0
ELSEWHERE,97,0,2
"ELSEWHERE, SPECIFY",1,20,0
EN ROUTE TO BOOKING CENTER/POLICE LOCKUP,2,2,0
EN ROUTE TO MEDICAL FACILITY,54,0,1
LAW ENFORCEMENT FACILITY/BOOKING CENTER,0,190,0
MEDICAL FACILITY,0,780,0


In [47]:
replacements = {
    'AT MEDICAL FACILITY': 'MEDICAL FACILITY',
    'AT LAW ENFORCEMENT FACILITY': 'LAW ENFORCEMENT FACILITY',
    'AT THE CRIME/ARREST SCENE': 'CRIME/ARREST SCENE',
    'SCENE OF INCIDENT': 'CRIME/ARREST SCENE',
    'LAW ENFORCEMENT FACILITY/BOOKING CENTER': 'LAW ENFORCEMENT FACILITY',
    'DEAD ON ARRIVAL AT MEDICAL FACILITY': 'EN ROUTE TO MEDICAL FACILITY',
    'EN ROUTE TO BOOKING CENTER/POLICE LOCKUP': 'EN ROUTE TO LAW ENFORCEMENT FACILITY',
    'ELSEWHERE': OTHER_SPECIFY,
    'ELSEWHERE, SPECIFY': OTHER_SPECIFY,
}
cdr['death_location'] = cdr['death_location'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr.death_location, cdr.form_version)

form_version,V_2005,V_2016,V_BJS
death_location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CRIME/ARREST SCENE,663,158,7
EN ROUTE TO LAW ENFORCEMENT FACILITY,2,2,0
EN ROUTE TO MEDICAL FACILITY,54,27,1
LAW ENFORCEMENT FACILITY,1265,190,13
MEDICAL FACILITY,3029,780,26
"OTHER, SPECIFY",98,20,2


### Fix: `means_of_death`

In [48]:
pd.crosstab(cdr.means_of_death, cdr.form_version)

form_version,V_2005,V_2016,V_BJS
means_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BATON / BLUNT INSTRUMENT,0,1,0
BLUNT INSTRUMENT,12,0,0
DON'T KNOW,107,0,2
DON\'T KNOW,32,0,0
DRUG OVERDOSE,151,24,0
FIREARM,877,187,5
"HANGING, STRANGULATION",540,92,3
KNIFE / EDGED INSTRUMENT,0,3,0
"KNIFE, CUTTING INSTRUMENT",38,0,0
NOT APPLICABLE,0,0,2


In [49]:
replacements = {
    'NOT APPLICABLE, CAUSE OF DEATH WAS ILLNESS/NATURAL CAUSE': 'NOT APPLICABLE',
    'NOT APPLICABLE; CAUSE OF DEATH WAS INTOXICATION OR ILLNESS/NATURAL CAUSES': 'NOT APPLICABLE',
    'OTHER': OTHER_SPECIFY,
    'KNIFE, CUTTING INSTRUMENT': 'KNIFE / EDGED INSTRUMENT',
    'BLUNT INSTRUMENT': 'BATON / BLUNT INSTRUMENT',
    "DON'T KNOW": 'UNKNOWN',
    "DON\\'T KNOW": 'UNKNOWN',
    'RIFLE/SHOTGUN': 'FIREARM',
}
cdr['means_of_death'] = cdr['means_of_death'].apply(lambda x:  None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr.means_of_death, cdr.form_version)

form_version,V_2005,V_2016,V_BJS
means_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BATON / BLUNT INSTRUMENT,12,1,0
DRUG OVERDOSE,151,24,0
FIREARM,877,187,6
"HANGING, STRANGULATION",540,92,3
KNIFE / EDGED INSTRUMENT,38,3,0
NOT APPLICABLE,3120,734,38
"OTHER, SPECIFY",233,30,3
UNKNOWN,139,93,2
VEHICLE ACCIDENT,1,13,0


### It seems officers sometimes misuse the 'other' option, selecting it when another category is more appropriate:

In [50]:
cdr[cdr.means_of_death == OTHER_SPECIFY]['means_of_death_other'].value_counts().head()

BLUNT FORCE TRAUMA RECEIVED DURING A MOTOR VEHICLE ACCIDENT    8
MOTOR VEHICLE ACCIDENT                                         8
VEHICLE                                                        6
FALL                                                           6
DROWNING                                                       6
Name: means_of_death_other, dtype: int64

### Let's roll all rare and 'unknown'-type values into the 'other' category. 'VEHICLE ACCIDENT' in particular is new in the 2016 form, only indicated as 'other' in older forms.

In [51]:
other_values = ['UNKNOWN', 'VEHICLE ACCIDENT', 'KNIFE / EDGED INSTRUMENT', 'BATON / BLUNT INSTRUMENT']
indices = cdr['means_of_death'].isin(other_values)
cdr.loc[indices, 'means_of_death_other'] = cdr.loc[indices, 'means_of_death']
cdr.loc[indices, 'means_of_death'] = OTHER_SPECIFY

pd.crosstab(cdr.means_of_death, cdr.form_version)

form_version,V_2005,V_2016,V_BJS
means_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DRUG OVERDOSE,151,24,0
FIREARM,877,187,6
"HANGING, STRANGULATION",540,92,3
NOT APPLICABLE,3120,734,38
"OTHER, SPECIFY",423,140,5


### Fix: `manner_of_death` and  `manner_of_death_description`

In [52]:
pd.crosstab(cdr.form_version, cdr.manner_of_death).T

form_version,V_2005,V_2016,V_BJS
manner_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ACCIDENTAL,3,34,0
ACCIDENTAL INJURY CAUSED BY OTHERS,20,0,4
ACCIDENTAL INJURY TO SELF,155,0,2
ALCOHOL/DRUG INTOXICATION,288,6,0
COULD NOT BE DETERMINED,3,7,0
HOMICIDE (INCLUDES JUSTIFIABLE HOMICIDE),15,142,0
HOMICIDE BY LAW ENFORCEMENT/CORRECTIONAL STAFF,0,0,160
JUSTIFIABLE HOMICIDE,668,0,0
NATURAL,18,667,0
NATURAL CAUSES/ILLNESS,2856,0,2396


In [53]:
pd.crosstab(cdr.manner_of_death, cdr.manner_of_death_description.notnull())

manner_of_death_description,False,True
manner_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1
ACCIDENTAL,34,3
ACCIDENTAL INJURY CAUSED BY OTHERS,24,0
ACCIDENTAL INJURY TO SELF,157,0
ALCOHOL/DRUG INTOXICATION,294,0
COULD NOT BE DETERMINED,9,1
HOMICIDE (INCLUDES JUSTIFIABLE HOMICIDE),156,1
HOMICIDE BY LAW ENFORCEMENT/CORRECTIONAL STAFF,159,1
JUSTIFIABLE HOMICIDE,668,0
NATURAL,668,17
NATURAL CAUSES/ILLNESS,62,5190


In [54]:
replacements = {
    'NATURAL': 'NATURAL CAUSES/ILLNESS',
    'JUSTIFIABLE HOMICIDE': 'HOMICIDE',
    'HOMICIDE BY LAW ENFORCEMENT/CORRECTIONAL STAFF': 'HOMICIDE',
    'OTHER HOMICIDE': 'HOMICIDE',
    'HOMICIDE (INCLUDES JUSTIFIABLE HOMICIDE)': 'HOMICIDE',
    'ACCIDENTAL INJURY CAUSED BY OTHERS': 'ACCIDENTAL',
    'ACCIDENTAL INJURY TO SELF': 'ACCIDENTAL',
    'OTHER': OTHER_SPECIFY,
    'OTHER - SPECIFY': OTHER_SPECIFY,
}
cdr['manner_of_death'] = cdr['manner_of_death'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))

# In past versions, "pending autopsy results" was not an option, and reports had "other"
# checked with some mention of pending autopsy in the free field. We emulate this here
# to preserve consistency across form versions.
other_values = ['PENDING AUTOPSY RESULTS', 'COULD NOT BE DETERMINED']
indices = cdr['manner_of_death'].isin(other_values)
cdr.loc[indices, 'manner_of_death_description'] = cdr.loc[indices, 'manner_of_death']
cdr.loc[indices, 'manner_of_death'] = OTHER_SPECIFY

pd.crosstab(cdr.form_version, cdr.manner_of_death).T

form_version,V_2005,V_2016,V_BJS
manner_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ACCIDENTAL,178,34,6
ALCOHOL/DRUG INTOXICATION,288,6,0
HOMICIDE,799,142,160
NATURAL CAUSES/ILLNESS,2874,667,2396
"OTHER, SPECIFY",140,203,6
SUICIDE,832,125,16


### A few suicides by hanging have the wrong `means_of_death`

In [55]:
frame = cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death != 'HANGING, STRANGULATION')]
frame = frame[(frame.medical_cause_of_death.fillna('').str.contains('HANGING')) |
              frame.manner_of_death_description.fillna('').str.contains('HANGING')]
print(len(frame))
frame[['medical_cause_of_death', 'means_of_death', 'means_of_death_other', 'manner_of_death', 'manner_of_death_description']].head()

43


Unnamed: 0,medical_cause_of_death,means_of_death,means_of_death_other,manner_of_death,manner_of_death_description
3433,HANGING,NOT APPLICABLE,,SUICIDE,
3388,ASPHYXIATION BY HANGING,NOT APPLICABLE,,SUICIDE,
3411,ASPHYXIA DUE TO HANGING,NOT APPLICABLE,,SUICIDE,
3463,ASPHYXIA DUE TO HANGING BY NECK,NOT APPLICABLE,,SUICIDE,
3297,HANGING,NOT APPLICABLE,,SUICIDE,


In [56]:
cdr.loc[frame.index, 'means_of_death'] = 'HANGING, STRANGULATION'

# Be sure we got them all
frame = cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death != 'HANGING, STRANGULATION')]
frame = frame[frame.medical_cause_of_death.fillna('').str.contains('HANGING')]
assert len(frame) == 0

### For a few suicides, the `manner_of_death` is 'NOT APPLICABLE', which makes no sense. We need to change these to OTHER, though (as shown below) more is going on

In [57]:
frame = cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death == 'NOT APPLICABLE')]
print(len(frame))
frame[['means_of_death', 'means_of_death_other', 'manner_of_death', 'manner_of_death_description', 'medical_cause_of_death']].head()

17


Unnamed: 0,means_of_death,means_of_death_other,manner_of_death,manner_of_death_description,medical_cause_of_death
3404,NOT APPLICABLE,,SUICIDE,,NORTRIPTHLINE OVERDOSE
3530,NOT APPLICABLE,,SUICIDE,,ASPHYXIATION
3264,NOT APPLICABLE,,SUICIDE,,EXSANGUINATION DUE TO LEFT INTERNAL JUGULAR VE...
3354,NOT APPLICABLE,,SUICIDE,,PERFORATING GUNSHOT WOUND OF LEFT CHEST
3856,NOT APPLICABLE,,SUICIDE,,TOXIC EFFECT OF LETHAL LEVELS OF QUETIAPINE


In [58]:
cdr.loc[frame.index, 'means_of_death'] = OTHER_SPECIFY
assert len(cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death == 'NOT APPLICABLE')]) == 0

### Fix: `pre_existing_medical_condition` 

In [59]:
pd.crosstab(cdr.form_version, cdr.pre_existing_medical_condition).T

form_version,V_2005,V_2016,V_BJS
pre_existing_medical_condition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
COULD NOT BE DETERMINED,0,647,0
DECEASED DEVELOPED CONDITION AFTER ADMISSION,0,55,0
DEVELOPED CONDITION AFTER ADMISSION,154,0,167
DON'T KNOW,1141,0,301
DON\'T KNOW,834,0,1
"NOT APPLICABLE; CAUSE OF DEATH WAS ACCIDENTAL INJURY, INTOXICATION, SUICIDE OR HOMICIDE",2045,352,32
PRE-EXISTING MEDICAL CONDITION,937,123,1927


In [60]:
replacements = {
    'DECEASED DEVELOPED CONDITION AFTER ADMISSION': 'DEVELOPED CONDITION AFTER ADMISSION',
    "DON'T KNOW": 'UNKNOWN',
    "DON\\'T KNOW": 'UNKNOWN',
    'NOT APPLICABLE; CAUSE OF DEATH WAS ACCIDENTAL INJURY, INTOXICATION, SUICIDE OR HOMICIDE': 'NOT APPLICABLE',
    'COULD NOT BE DETERMINED': 'UNKNOWN',
    'PRE-EXISTING MEDICAL CONDITION': 'YES',
}
cdr['pre_existing_medical_condition'] = cdr['pre_existing_medical_condition'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr.form_version, cdr.pre_existing_medical_condition).T

form_version,V_2005,V_2016,V_BJS
pre_existing_medical_condition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DEVELOPED CONDITION AFTER ADMISSION,154,55,167
NOT APPLICABLE,2045,352,32
UNKNOWN,1975,647,302
YES,937,123,1927


### Fix: `who_caused_the_death` 

NOTE: This question is framed as follows:
* 2005 form: "If the death was an accident or homicide, who caused the death?"
* 2016 form: "If the death was an accident, homicide **or suicide**, who caused the death?" (emphasis added)

Thus, we need to:
1. Collapse near-identical values from different forms, similar to the other areas here.
1. Remove suicides from the 2016 responses, as they skew the data (see below). While we're add it, change ANY entries that are not of type homicide/suicide to have "NOT APPLICABLE" as the value.

#### Collapse near-identical values from different forms, similar to the other areas here.

In [61]:
pd.crosstab(cdr.who_caused_the_death, cdr.death_date_and_time.dt.year)

death_date_and_time,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
who_caused_the_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
DECEASED,11,13,14,13,12,14,5,13,19,12,13,14,0,0
DECEDENT,0,0,0,0,0,0,0,0,0,0,0,8,112,59
DON'T KNOW,3,4,6,15,23,24,21,11,8,5,1,0,0,0
DON\'T KNOW,0,0,0,0,0,0,0,0,0,2,18,13,0,0
LAW ENFORCEMENT/CORRECTIONAL PERSONNEL,0,0,0,0,0,0,0,0,1,2,2,14,78,51
LAW ENFORCEMENT/CORRECTIONAL STAFF,41,45,66,65,74,67,58,97,90,92,114,69,0,0
NOT APPLICABLE,0,0,0,0,0,0,0,0,0,0,0,60,500,256
"NOT APPLICABLE; CAUSE OF DEATH WAS SUICIDE, INTOXICATION OR ILLNESS/NATURAL CAUSES",232,262,290,255,227,217,230,238,566,524,550,443,0,0
OTHER CIVILIAN(S),0,0,0,0,0,0,0,0,0,0,0,0,1,1
OTHER DETAINEE(S),0,0,0,0,0,0,0,0,0,0,0,2,5,2


In [62]:
replacements = {
    'DECEASED': 'DECEDENT',
    "DON'T KNOW": 'UNKNOWN',
    "DON\\'T KNOW": 'UNKNOWN',
    'LAW ENFORCEMENT/CORRECTIONAL STAFF': 'LAW ENFORCEMENT/CORRECTIONAL PERSONNEL',
    'NOT APPLICABLE; CAUSE OF DEATH WAS SUICIDE, INTOXICATION OR ILLNESS/NATURAL CAUSES': 'NOT APPLICABLE',
    'OTHER DETAINEES': 'OTHER DETAINEE(S)',
    'OTHER PERSONS': 'OTHER CIVILIAN(S)',
    'ACCIDENTAL INJURY TO SELF': 'ACCIDENTAL',
    'UNKNOWN PERSON(S) CAUSED THE INJURY': 'UNKNOWN',
    'UNKNOWN WHETHER DECEDENT SUSTAINED A FATAL INJURY': 'UNKNOWN',
}
cdr['who_caused_the_death'] = cdr['who_caused_the_death'].apply(lambda x:  None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['who_caused_the_death'], cdr.form_version)

form_version,V_2005,V_2016,V_BJS
who_caused_the_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DECEDENT,153,179,0
LAW ENFORCEMENT/CORRECTIONAL PERSONNEL,719,147,160
NOT APPLICABLE,3994,816,40
OTHER CIVILIAN(S),46,2,0
OTHER DETAINEE(S),48,9,3
UNKNOWN,151,24,3


#### Remove suicides from the 2016 responses to 'who caused the death' question, as in previous years 'who caused the death' was only asked for homicides and accidents. (see below)

In [63]:
pd.crosstab(cdr.who_caused_the_death, cdr.manner_of_death)

manner_of_death,ACCIDENTAL,ALCOHOL/DRUG INTOXICATION,HOMICIDE,NATURAL CAUSES/ILLNESS,"OTHER, SPECIFY",SUICIDE
who_caused_the_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
DECEDENT,112,18,4,3,45,150
LAW ENFORCEMENT/CORRECTIONAL PERSONNEL,2,2,987,0,35,0
NOT APPLICABLE,58,250,22,3521,190,809
OTHER CIVILIAN(S),12,1,20,2,12,1
OTHER DETAINEE(S),8,0,49,0,3,0
UNKNOWN,23,23,19,48,62,3


In [64]:
cdr.loc[~cdr.manner_of_death.isin(['HOMICIDE', 'ACCIDENTAL', OTHER_SPECIFY]), 'who_caused_the_death'] = 'NOT APPLICABLE'
pd.crosstab(cdr.who_caused_the_death, cdr.manner_of_death)

manner_of_death,ACCIDENTAL,ALCOHOL/DRUG INTOXICATION,HOMICIDE,NATURAL CAUSES/ILLNESS,"OTHER, SPECIFY",SUICIDE
who_caused_the_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
DECEDENT,112,0,4,0,45,0
LAW ENFORCEMENT/CORRECTIONAL PERSONNEL,2,0,987,0,35,0
NOT APPLICABLE,58,294,22,5937,190,973
OTHER CIVILIAN(S),12,0,20,0,12,0
OTHER DETAINEE(S),8,0,49,0,3,0
UNKNOWN,23,0,19,0,62,0


### Fix: `were_the_charges` 

In [65]:
pd.crosstab(cdr['were_the_charges'], cdr.form_version)

form_version,V_2005,V_2016,V_BJS
were_the_charges,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A PROBATION/PAROLE VIOLATION,0,17,0
CONVICTED,2840,795,2411
FILED,831,105,5
NOT FILED AT TIME OF DEATH,1332,260,14
PROBATION/PAROLE,108,0,0


In [66]:
replacements = {
    'PROBATION/PAROLE': 'PROBATION/PAROLE VIOLATION',
    'A PROBATION/PAROLE VIOLATION': 'PROBATION/PAROLE VIOLATION',
}
cdr['were_the_charges'] = cdr['were_the_charges'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['were_the_charges'], cdr.form_version)

form_version,V_2005,V_2016,V_BJS
were_the_charges,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CONVICTED,2840,795,2411
FILED,831,105,5
NOT FILED AT TIME OF DEATH,1332,260,14
PROBATION/PAROLE VIOLATION,108,17,0


### Fix `type_of_custody`

In [67]:
pd.crosstab(cdr['type_of_custody'], cdr.form_version)

form_version,V_2005,V_2016,V_BJS
type_of_custody,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
COUNTY JAIL,1035,131,12
MUNICIPAL JAIL,142,14,0
PENITENTIARY,2642,777,2409
POLICE CUSTODY (PRE-BOOKING),1276,120,163
PRE-CUSTODIAL USE OF FORCE,0,115,0
PRIVATE CORRECTIONAL FACILITY,16,0,0
PRIVATE FACILITY,0,20,0


In [68]:
replacements = {
    'PRE-CUSTODIAL USE OF FORCE': 'POLICE CUSTODY (PRE-BOOKING)',
    'PRIVATE CORRECTIONAL FACILITY': 'PRIVATE FACILITY',
    'COUNTY JAIL': 'JAIL',
    'MUNICIPAL JAIL': 'JAIL',
    'PENITENTIARY': 'PRISON',
}
cdr['type_of_custody'] = cdr['type_of_custody'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['type_of_custody'], cdr.form_version)

form_version,V_2005,V_2016,V_BJS
type_of_custody,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
JAIL,1177,145,12
POLICE CUSTODY (PRE-BOOKING),1276,235,163
PRISON,2642,777,2409
PRIVATE FACILITY,16,20,0


### Fix `specific_type_of_custody_facility`

In [69]:
pd.crosstab(cdr.specific_type_of_custody_facility, cdr.form_version)

form_version,V_2005,V_2016,V_BJS
specific_type_of_custody_facility,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CORRECTIONAL/REHABILITATION FACILITY,33,1,1
CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FLEEING ARREST,6,134,0
CUSTODY OF LAW ENFORCEMENT PERSONNEL SUBSEQUENT TO ARREST,2,42,0
CUSTODY OF PEACE OFFICER DURING/FLEEING ARREST,940,0,7
CUSTODY OF PEACE OFFICER SUBSEQUENT TO ARREST,242,0,3
HALFWAY HOUSE/RESTITUTION CENTER,12,0,0
HOSPITAL/INFIRMARY,304,51,3
JAIL - DAY ROOM/RECREATION AREA,25,4,1
JAIL - DETOX CELL,60,6,0
JAIL - HOLDING CELL,124,15,4


In [70]:
replacements = {
    'TDCJ, SPECIFY': 'TDCJ',
    'CUSTODY OF PEACE OFFICER DURING/FLEEING ARREST': 'CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FLEEING ARREST',
    'CUSTODY OF PEACE OFFICER SUBSEQUENT TO ARREST': 'CUSTODY OF LAW ENFORCEMENT PERSONNEL AFTER ARREST',
    'CUSTODY OF LAW ENFORCEMENT PERSONNEL SUBSEQUENT TO ARREST': 'CUSTODY OF LAW ENFORCEMENT PERSONNEL AFTER ARREST',
    'TEXAS-JUVENILE JUSTICE DEPARTMENT - FACILITY/DETENTION CENTER, SPECIFY': 'OTHER',
    'TJPC': 'OTHER',
    'TYC': 'OTHER',
    'HALFWAY HOUSE/RESTITUTION CENTER': 'OTHER',
    'CORRECTIONAL/REHABILITATION FACILITY': 'OTHER',
    'NON-LAW ENFORCEMENT DETOX FACILITY': 'OTHER',
}
cdr['specific_type_of_custody_facility'] = cdr['specific_type_of_custody_facility'].apply(
    lambda x: x if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['specific_type_of_custody_facility'], cdr.form_version)

form_version,V_2005,V_2016,V_BJS
specific_type_of_custody_facility,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CUSTODY OF LAW ENFORCEMENT PERSONNEL AFTER ARREST,244,42,3
CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FLEEING ARREST,946,134,7
HOSPITAL/INFIRMARY,304,51,3
JAIL - DAY ROOM/RECREATION AREA,25,4,1
JAIL - DETOX CELL,60,6,0
JAIL - HOLDING CELL,124,15,4
JAIL - MULTIPLE OCCUPANCY CELL,380,38,2
JAIL - SINGLE CELL,357,50,0
OTHER,51,4,1
TDCJ,2620,775,2409


### Glance at the types of custody crossed with facility

In [71]:
pd.crosstab(cdr.specific_type_of_custody_facility, cdr.type_of_custody)

type_of_custody,JAIL,POLICE CUSTODY (PRE-BOOKING),PRISON,PRIVATE FACILITY
specific_type_of_custody_facility,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CUSTODY OF LAW ENFORCEMENT PERSONNEL AFTER ARREST,16,273,0,0
CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FLEEING ARREST,3,1084,0,0
HOSPITAL/INFIRMARY,243,94,3,18
JAIL - DAY ROOM/RECREATION AREA,30,0,0,0
JAIL - DETOX CELL,65,1,0,0
JAIL - HOLDING CELL,138,5,0,0
JAIL - MULTIPLE OCCUPANCY CELL,417,1,0,2
JAIL - SINGLE CELL,401,0,3,3
OTHER,18,3,23,12
TDCJ,2,2,5799,1


#### Identify and drop a range of unnecessary columns

In [72]:
cdr.groupby([cdr.entry_date_time.isnull(), cdr.entry_date_time_n_a]).size().unstack()

entry_date_time_n_a,0.0,1.0
entry_date_time,Unnamed: 1_level_1,Unnamed: 2_level_1
False,4504.0,
True,,1833.0


In [73]:
pd.crosstab(cdr.custody_date_na, cdr['date_time_of_custody_or_incident'].isnull())

date_time_of_custody_or_incident,False
custody_date_na,Unnamed: 1_level_1
0.0,5113
1.0,1


In [74]:
cdr.drop(['entry_date_time_n_a', 'custody_date_na'], axis=1, inplace=True)

In [75]:
pd.crosstab(cdr.type_of_offense.notnull(), cdr.form_version)

form_version,V_2005,V_2016,V_BJS
type_of_offense,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,5101,1,162
True,10,1176,2422


In [76]:
pd.crosstab(cdr.type_of_offense_other.notnull(), cdr.form_version)

form_version,V_2005,V_2016,V_BJS
type_of_offense_other,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,4351,1034,2575
True,760,143,9


In [77]:
cdr.drop(['type_of_offense', 'type_of_offense_other'], axis=1, inplace=True)

In [78]:
pd.crosstab(cdr.department_type.notnull(), cdr.form_version)

form_version,V_2005,V_2016,V_BJS
department_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,1,1177,2584
True,5110,0,0


In [79]:
cdr.drop('department_type', axis=1, inplace=True)

In [80]:
cdr['other_behavior'].value_counts()

0.0    4937
1.0     174
Name: other_behavior, dtype: int64

In [81]:
pd.crosstab(cdr['other_behavior'], cdr['specify_other_behavior'].notnull())

specify_other_behavior,False,True
other_behavior,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,4937,0
1.0,0,174


In [82]:
cdr['other_behavior'] = cdr['specify_other_behavior']
cdr.drop('specify_other_behavior', axis=1, inplace=True)

## Add new columns

### Column for the time (in days) between incarceration (or incident) and death
Note: for a few of these, the death date/time is before the incarceration/incident date. If they are only one day apart, we just call it 0 and assume it was a minor error. If they are more than a day apart, clearly there was a larger error, so we use a NaN value.

In [83]:
cdr.date_time_of_custody_or_incident.dtype

dtype('<M8[ns]')

In [84]:
def get_days(dt):
    if dt.days < -1:
        return None
    elif dt.days == -1:
        return 0
    else:
        return dt.days

delta = cdr.death_date_and_time - cdr.date_time_of_custody_or_incident
print("For %d records with death date before custoday date, setting the days_from_custody_to_death to NaN" % (delta.dt.days < -1).sum())
cdr['days_from_custody_to_death'] = delta.apply(get_days)

For 6 records with death date before custoday date, setting the days_from_custody_to_death to NaN


In [85]:
cdr['name_full'] = ''
for col in ['first_name', 'middle_name', 'last_name', 'suffix']:
    cdr['name_full'] = cdr['name_full'] + ' ' + cdr[col].fillna('')
cdr['name_full'] = cdr['name_full'].apply(lambda s: ' '.join(s.strip().split()))
cdr.loc[cdr['name_full'] == '', 'name_full'] = np.nan

## Tweak bookkeping columns

In [86]:
cdr['num_revisions'] = cdr['version_number'] - 1
cdr.drop(['version_type', 'version_number'], axis=1, inplace=True)

## Rename and reorder columns sensibly

In [87]:
col_renames = {
    'first_name': 'name_first',
    'middle_name': 'name_middle',
    'last_name': 'name_last',
    'suffix': 'name_suffix',
    'cdr_cdr_name': 'record_id',
    'death_causer_other': 'who_caused_death_in_homicide_or_accident_other',
    'who_caused_the_death': 'who_caused_death_in_homicide_or_accident',
    'death_location': 'death_location_type',
    'death_location_elsewhere': 'death_location_type_other',
    'city': 'death_location_city',
    'county': 'death_location_county',
    'latitude': 'death_location_latitude',
    'longitude': 'death_location_longitude',
    'census_tract': 'death_location_census_tract',
    'street_address': 'death_location_street_address',
    'entry_date_time': 'facility_entry_date_time',
    'pre_existing_medical_condition': 'death_from_pre_existing_medical_condition',
}

In [88]:
cdr.rename(columns=col_renames, inplace=True)

In [89]:
after = cdr[[
    # Record indexing columns
    'record_id',
    'num_revisions',
    'form_version',
    'report_date',
    'date_time_of_custody_or_incident',

    # Deceased personal information, demographics
    'name_first',
    'name_last',
    'name_middle',
    'name_suffix',
    'name_full',
    'date_of_birth',
    'age_at_time_of_death',
    'sex',
    'race',

    # Death event information
    'death_date',
    'death_date_and_time',
    'death_location_county',
    'death_location_city',
    'death_location_street_address',
    'death_location_latitude',
    'death_location_longitude',
    'death_location_census_tract',
    'death_location_type',
    'death_location_type_other',
    'death_from_pre_existing_medical_condition',
    'manner_of_death',
    'manner_of_death_description',
    'means_of_death',
    'means_of_death_other',
    'medical_cause_of_death',
    'medical_examinor_coroner_evalution',
    'medical_treatment',
    'days_from_custody_to_death',
    'who_caused_death_in_homicide_or_accident',
    'who_caused_death_in_homicide_or_accident_other',

    # Criminal information on deceased
    'offense_1',
    'offense_2',
    'offense_3',
    'were_the_charges',

    # Facility and agency information
    'facility_entry_date_time',
    'type_of_custody',
    'specific_type_of_custody_facility',
    'agency_address',
    'agency_city',
    'agency_county',
    'agency_name',
    'agency_zip',
    
    # Deceased behavior upon entry or custody
    'entry_behavior',
    'other_behavior',
    'exhibit_any_medical_problems',
    'exhibit_any_mental_health_problems',
    'make_suicidal_statements',
]]
assert cdr.shape == after.shape
cdr = after

In [90]:
cdr.sample(10)

Unnamed: 0,record_id,num_revisions,form_version,report_date,date_time_of_custody_or_incident,name_first,name_last,name_middle,name_suffix,name_full,date_of_birth,age_at_time_of_death,sex,race,death_date,death_date_and_time,death_location_county,death_location_city,death_location_street_address,death_location_latitude,death_location_longitude,death_location_census_tract,death_location_type,death_location_type_other,death_from_pre_existing_medical_condition,manner_of_death,manner_of_death_description,means_of_death,means_of_death_other,medical_cause_of_death,medical_examinor_coroner_evalution,medical_treatment,days_from_custody_to_death,who_caused_death_in_homicide_or_accident,who_caused_death_in_homicide_or_accident_other,offense_1,offense_2,offense_3,were_the_charges,facility_entry_date_time,type_of_custody,specific_type_of_custody_facility,agency_address,agency_city,agency_county,agency_name,agency_zip,entry_behavior,other_behavior,exhibit_any_medical_problems,exhibit_any_mental_health_problems,make_suicidal_statements
1452,PA14465P,0.0,V_2005,2014-10-08 10:41:00,1996-10-01 00:00:00,RICKEY,JONES,,,RICKEY JONES,1960-06-11,54.0,M,BLACK,2014-09-15,2014-09-15 23:00:00,GALVESTON,GALVESTON,809 HARBORSIDE DRIVE,,,,MEDICAL FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,HEMORRHAGIC SHOCK,NOT APPLICABLE,,HEMORRHAGIC SHOCK SECONDARY TO AN UNIDENTIFIED...,"YES, RESULTS ARE AVAILABLE",YES,6558.0,NOT APPLICABLE,,AGGRAVATED SEXUAL ASSAULT,,,CONVICTED,1996-10-01,PRISON,TDCJ,"2503 LAKE ROAD, SUITE 5",HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77340.0,,,,,
398,PA16196CJ,0.0,V_2005,2016-05-16 14:34:00,2016-02-10 00:44:00,DONOVAN,PHILLIPS,,,DONOVAN PHILLIPS,1974-11-16,41.0,M,BLACK,2016-05-15,2016-05-15 22:40:00,HARRIS,HOUSTON,1200 BAKER STREET,,,,MEDICAL FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,CARDIOVASCULAR DISEASE,NOT APPLICABLE,,ATHEROSCLEROTIC AND HYPERTENSIVE CARDIOVASCULA...,"YES, RESULTS ARE AVAILABLE",NOT APPLICABLE,95.0,NOT APPLICABLE,,POSS W/INTENT DELIVER,,,FILED,NaT,JAIL,JAIL - MULTIPLE OCCUPANCY CELL,1200 BAKER STREET,HOUSTON,HARRIS,HARRIS CO SHERIFFS OFFICE,77002.0,,,,,
11245,PRISON2012439,,V_BJS,2012-12-01 00:00:00,2011-08-16 00:00:00,RACHENDA,SAVAGE,K,,RACHENDA K SAVAGE,1975-10-14,37.0,F,BLACK,2012-11-01,2012-11-01 00:00:00,,GALVESTON,,,,,,,YES,NATURAL CAUSES/ILLNESS,"LUPUS, CEREBRITIS",,,,NO,YES,443.0,NOT APPLICABLE,,AGG ASSLT CAUSES SBI,0,0,CONVICTED,NaT,PRISON,TDCJ,,,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,,,,,,
7480,PRISON20064033,,V_BJS,2006-12-01 00:00:00,2006-04-07 00:00:00,ROSALIE,GONZALES,,,ROSALIE GONZALES,1961-09-13,45.0,F,HISPANIC,2006-10-29,2006-10-29 00:00:00,,GALVESTON,,,,,,,YES,NATURAL CAUSES/ILLNESS,CRYPTOCOCCAL MENINGITIS,,,,NO,YES,205.0,NOT APPLICABLE,,NOT SPECIFIED,0,0,CONVICTED,NaT,PRISON,TDCJ,,,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,,,,,,
6236,17-84-P,0.0,V_2016,2017-02-28 12:58:00,2012-08-03 00:00:00,ALBERTO,RIVAS,M,,ALBERTO M RIVAS,1958-11-21,58.0,M,HISPANIC,2017-02-17,2017-02-17 11:15:00,POTTER,AMARILLO,9601 SPUR 591,,,,LAW ENFORCEMENT FACILITY,,DEVELOPED CONDITION AFTER ADMISSION,NATURAL CAUSES/ILLNESS,,NOT APPLICABLE,,END STAGE LIVER DISEASE,"YES, RESULTS PENDING",YES,1659.0,NOT APPLICABLE,,DRIVING WHILE INTOXICATED,BURGLARY OF A BUILDING,,CONVICTED,2012-08-03,PRISON,TDCJ,"2503 LAKE ROAD, SUITE 5",HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77340.0,,,UNKNOWN,UNKNOWN,UNKNOWN
11125,PRISON2012387,,V_BJS,2012-09-01 00:00:00,2003-05-02 00:00:00,A.,GOMEZ,,,A. GOMEZ,1954-12-22,58.0,M,HISPANIC,2012-09-12,2012-09-12 00:00:00,,NAVASOTA,,,,,,,YES,NATURAL CAUSES/ILLNESS,RUPTURED ESOPHAGEAL DUE TO CHRONIC HEPATITIS C,,,,YES,YES,3421.0,NOT APPLICABLE,,POSS MARIJ <= 5LBS > 4OZ,POSS CS PG 1 >=4G < 200G,POSS CS PG 1 >= 1G < 4G,CONVICTED,NaT,PRISON,TDCJ,,,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,,,,,,
3379,PA10143P,0.0,V_2005,2010-07-23 14:42:00,2005-10-02 00:00:00,ERIC,BRACKEN,,,ERIC BRACKEN,1961-01-05,49.0,M,WHITE,2010-07-19,2010-07-19 18:05:00,LUBBOCK,LUBBOCK,8602 PEACH STREET,,,,MEDICAL FACILITY,,YES,NATURAL CAUSES/ILLNESS,CORONARY ATHEROSCLEROSIS WITH ACUTE THROMBOSIS...,NOT APPLICABLE,,CORONARY ATHEROSCLEROSIS WITH ACUTE THROMBOSIS...,"YES, RESULTS ARE AVAILABLE",YES,1751.0,NOT APPLICABLE,,INDECENCY WITH A CHILD/CONTACT,,,CONVICTED,2005-12-05,PRISON,TDCJ,"2503 LAKE ROAD, SUITE 5",HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77340.0,,,,,
10613,PRISON2011453,,V_BJS,2011-12-01 00:00:00,1998-04-02 00:00:00,THOMAS,LOVATO,,,THOMAS LOVATO,1949-02-01,63.0,M,WHITE,2011-11-22,2011-11-22 00:00:00,,LUBBOCK,,,,,,,YES,NATURAL CAUSES/ILLNESS,RESPIRATORY FAILURE - HEART DISEASE,,,,NO,YES,4982.0,NOT APPLICABLE,,AGG SEXUAL ASSLT,0,0,CONVICTED,NaT,PRISON,TDCJ,,,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,,,,,,
1891,PA14023P,0.0,V_2005,2014-01-30 09:40:00,2001-10-18 00:00:00,DUANE,MCDANIEL,,,DUANE MCDANIEL,1957-10-16,56.0,M,WHITE,2014-01-04,2014-01-04 23:00:00,ANDERSON,PALESTINE,1400 FM 3452,,,,MEDICAL FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,CARDIAC ARREST,NOT APPLICABLE,,CARDIAC ARREST,"NO, EVALUATION NOT PLANNED",YES,4461.0,NOT APPLICABLE,,MANSLAUGHTER,,,CONVICTED,2001-10-18,PRISON,TDCJ,"2503 LAKE ROAD, SUITE 5",HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77340.0,,,,,
3819,PA09019P,0.0,V_2005,2009-02-12 10:14:00,1992-11-23 00:00:00,WILLARD,DAVIS,,,WILLARD DAVIS,1958-09-28,50.0,M,BLACK,2009-02-09,2009-02-09 05:11:00,BROWN,ROSHARON,1300 FM 655,,,,MEDICAL FACILITY,,UNKNOWN,"OTHER, SPECIFY",SUDDEN CARDIAC ARREST DUE TO HYPERTENSIVE CARD...,NOT APPLICABLE,,SUDDEN CARDIAC ARREST DUE TO HYPERTENSIVE CARD...,"YES, RESULTS ARE AVAILABLE",NO,5922.0,NOT APPLICABLE,,BURGLARY OF HABITATION,,,CONVICTED,1992-11-23,PRISON,TDCJ,"2503 LAKE ROAD, SUITE 5",HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77340.0,,,,,


In [91]:
show_notnull_chart(cdr)

Unnamed: 0,age_at_time_of_death,agency_address,agency_city,agency_county,agency_name,agency_zip,date_of_birth,date_time_of_custody_or_incident,days_from_custody_to_death,death_date,death_date_and_time,death_from_pre_existing_medical_condition,death_location_census_tract,death_location_city,death_location_county,death_location_latitude,death_location_longitude,death_location_street_address,death_location_type,death_location_type_other,entry_behavior,exhibit_any_medical_problems,exhibit_any_mental_health_problems,facility_entry_date_time,form_version,make_suicidal_statements,manner_of_death,manner_of_death_description,means_of_death,means_of_death_other,medical_cause_of_death,medical_examinor_coroner_evalution,medical_treatment,name_first,name_full,name_last,name_middle,name_suffix,num_revisions,offense_1,offense_2,offense_3,other_behavior,race,record_id,report_date,sex,specific_type_of_custody_facility,type_of_custody,were_the_charges,who_caused_death_in_homicide_or_accident,who_caused_death_in_homicide_or_accident_other
V_2005,1.0,1.0,1.0,1.0,1,1.0,1.0,1.0,0.999022,1,1,1.0,0,0.998826,1.0,0,0,0.999804,1.0,0.01937,0.141851,0.0,0.0,0.696537,1,0.0,1,0.591665,1.0,0.0829583,0.999413,1.0,0.999804,1,1,1.0,0.558403,0.0702407,1,0.998826,0.26081,0.0817844,0.0340442,1.0,1,1.0,1,1.0,1,1.0,1.0,0.00919585
V_2016,1.0,1.0,1.0,0.994902,1,1.0,1.0,1.0,0.99915,1,1,1.0,0,0.99915,1.0,0,0,1.0,1.0,0.0195412,0.0,0.885302,0.880204,0.774002,1,0.880204,1,0.175021,1.0,0.122345,0.996602,1.0,1.0,1,1,1.0,0.371283,0.0458794,1,0.994053,0.26763,0.097706,0.0,1.0,1,0.99915,1,0.950722,1,1.0,1.0,0.0
V_BJS,0.993808,0.0201238,0.0201238,0.997291,1,0.0201238,0.934985,0.938854,0.938854,1,1,0.939628,0,0.940402,0.0201238,0,0,0.0201238,0.0189628,0.000773994,0.00116099,0.0,0.0,0.0127709,1,0.0,1,0.905573,0.0201238,0.00193498,0.0201238,0.940402,0.940402,1,1,0.997678,0.625774,0.00270898,0,0.940402,0.912926,0.903638,0.0,0.985294,1,0.940402,1,0.940402,1,0.940402,0.998065,0.0


In [92]:
print(len(cdr))

8872


In [93]:
cdr.form_version.value_counts()

V_2005    5111
V_BJS     2584
V_2016    1177
Name: form_version, dtype: int64

## Write

In [94]:
with dw.open_remote_file(DTW_PROJECT_KEY_CDR, CLEANED_FILENAME) as w:
    print("Writing to data.world:", CLEANED_FILENAME)
    cdr.to_csv(w, index=False)

Writing to data.world: cleaned_custodial_death_reports.csv


---
## TODOs -- problems that still exist in the data and need work
---

### [TODO] Some duplicates still exist, but it's unclear which to drop

In [95]:
dup_record_id = cdr.duplicated(subset='record_id', keep=False)  # Note keep=False to mark both in a duplicate pair
print("Not sure what to do with %d inconsistent duplicates" % dup_record_id.sum())
cdr[dup_record_id].sort_values(['record_id']).head(6)

Not sure what to do with 80 inconsistent duplicates


Unnamed: 0,record_id,num_revisions,form_version,report_date,date_time_of_custody_or_incident,name_first,name_last,name_middle,name_suffix,name_full,date_of_birth,age_at_time_of_death,sex,race,death_date,death_date_and_time,death_location_county,death_location_city,death_location_street_address,death_location_latitude,death_location_longitude,death_location_census_tract,death_location_type,death_location_type_other,death_from_pre_existing_medical_condition,manner_of_death,manner_of_death_description,means_of_death,means_of_death_other,medical_cause_of_death,medical_examinor_coroner_evalution,medical_treatment,days_from_custody_to_death,who_caused_death_in_homicide_or_accident,who_caused_death_in_homicide_or_accident_other,offense_1,offense_2,offense_3,were_the_charges,facility_entry_date_time,type_of_custody,specific_type_of_custody_facility,agency_address,agency_city,agency_county,agency_name,agency_zip,entry_behavior,other_behavior,exhibit_any_medical_problems,exhibit_any_mental_health_problems,make_suicidal_statements
5471,18-148-P,0.0,V_2016,2018-03-16 09:40:00,2011-02-07 00:00:00,PEDRO,ZAMORA,,,PEDRO ZAMORA,1971-09-23,46.0,M,HISPANIC,2018-02-23,2018-02-23 19:02:00,LUBBOCK,LUBBOCK,8602 PEACH AVE,,,,MEDICAL FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,,NOT APPLICABLE,,PANCREATITIS /PNEUMONIA,"NO, EVALUATION NOT PLANNED",YES,2573.0,NOT APPLICABLE,,INDECENCY WITH A CHILD,,,CONVICTED,2011-02-07 00:00:00,PRISON,TDCJ,1012 VETERANS MEMORIAL PARKWAY,HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77320.0,,,YES,UNKNOWN,UNKNOWN
5470,18-148-P,0.0,V_2016,2018-03-16 09:47:00,2017-12-07 00:00:00,BRUCE,DANNER,,,BRUCE DANNER,1971-10-10,46.0,M,WHITE,2018-02-26,2018-02-26 16:15:00,JEFFERSON,BEAUMONT,3295 FM 3514,,,,MEDICAL FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,,NOT APPLICABLE,,ANOXIC ENCEPHALOPATHY,"NO, EVALUATION NOT PLANNED",UNKNOWN,81.0,NOT APPLICABLE,,POSSESSION OF A CONTROLLED SUBSTANCE,,,CONVICTED,2017-12-07 00:00:00,PRISON,TDCJ,1012 VETERANS MEMORIAL PARKWAY,HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77320.0,,,YES,UNKNOWN,UNKNOWN
5108,PA05042P,0.0,V_2005,2005-04-18 13:37:00,2004-09-24 10:00:00,ARTHUR,ETTER,,,ARTHUR ETTER,2005-05-05,-1.0,M,WHITE,2005-03-01,2005-03-01 20:24:00,GALVESTON,GALVESTON,BETO UNIT,,,,MEDICAL FACILITY,,DEVELOPED CONDITION AFTER ADMISSION,NATURAL CAUSES/ILLNESS,PNEUMONIA,NOT APPLICABLE,,RESPIRATORY FAILURE,"NO, EVALUATION NOT PLANNED",YES,158.0,NOT APPLICABLE,,CRIMINAL SOLICITATION OF A MINOR,,,CONVICTED,2004-09-24 10:00:00,PRISON,TDCJ,P.O. BOX 4003,HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,773424003.0,,,,,
6520,PA05042P,,V_BJS,2005-04-18 13:37:00,2004-09-24 10:00:00,ARTHUR,ETTER,,,ARTHUR ETTER,2023-03-02,82.0,M,WHITE,2005-03-01,2005-03-01 00:00:00,GALVESTON,GALVESTON,BETO UNIT,,,,MEDICAL FACILITY,,DEVELOPED CONDITION AFTER ADMISSION,NATURAL CAUSES/ILLNESS,PNEUMONIA,NOT APPLICABLE,,RESPIRATORY FAILURE,"NO, EVALUATION NOT PLANNED",YES,157.0,NOT APPLICABLE,,CRIMINAL SOLICITATION OF A MINOR,,,CONVICTED,2004-09-24 00:00:00,PRISON,TDCJ,P.O. BOX 4003,HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,773424000.0,,,,,
6695,PA05142CJ,,V_BJS,2005-08-11 14:08:00,2005-05-01 13:56:00,FRED,GREEN,,,FRED GREEN,1952-07-14,53.0,M,BLACK,2005-07-14,2005-07-14 00:00:00,DALLAS,DALLAS,133 N. INDUSTRIAL BLVD.,,,,MEDICAL FACILITY,,DEVELOPED CONDITION AFTER ADMISSION,NATURAL CAUSES/ILLNESS,BRAIN ANEURISM,NOT APPLICABLE,,RUPTURED BERRY ANEURYSM,"YES, RESULTS ARE AVAILABLE",NO,73.0,NOT APPLICABLE,,POSSESSION,PUBLIC INTOXICATION,,FILED,2005-05-01 00:00:00,JAIL,HOSPITAL/INFIRMARY,133 N. INDUSTRIAL BLVD.,DALLAS,DALLAS,DALLAS CO SHERIFFS OFFICE,75207.0,,,,,
5008,PA05142CJ,0.0,V_2005,2005-08-11 14:08:00,2005-05-01 13:56:00,FRED,GREEN,,,FRED GREEN,2005-06-29,0.0,M,BLACK,2005-07-14,2005-07-14 23:25:00,DALLAS,DALLAS,133 N. INDUSTRIAL BLVD.,,,,MEDICAL FACILITY,,DEVELOPED CONDITION AFTER ADMISSION,NATURAL CAUSES/ILLNESS,BRAIN ANEURISM,NOT APPLICABLE,,RUPTURED BERRY ANEURYSM,"YES, RESULTS ARE AVAILABLE",NO,74.0,NOT APPLICABLE,,POSSESSION,PUBLIC INTOXICATION,,FILED,2005-05-01 13:56:00,JAIL,HOSPITAL/INFIRMARY,133 N. INDUSTRIAL BLVD.,DALLAS,DALLAS,DALLAS CO SHERIFFS OFFICE,75207.0,,,,,


### [TODO] `date_of_birth` and `age_at_time_of_death` can sometimes be impossible

In [96]:
cdr.age_at_time_of_death.value_counts().sort_index().loc[:10]

-25.0     1
-24.0     1
-21.0     3
-19.0     2
-17.0     1
-13.0     1
-9.0      1
-1.0      5
 0.0     15
 1.0      1
 6.0      1
 8.0      1
Name: age_at_time_of_death, dtype: int64

In [97]:
cdr.date_of_birth.dt.year.value_counts().sort_index().loc[2015:]

2015.0     4
2016.0     1
2018.0     5
2019.0     2
2020.0     2
2021.0     5
2022.0     5
2023.0     5
2024.0    11
2025.0     6
2026.0    19
2027.0    19
2028.0    10
2029.0    18
Name: date_of_birth, dtype: int64

### [TODO] The list of possible offenses is enourmous and needs condensing
#### Of ~4600 custodial deaths, there are 2000 unique `offense_1` values, e.g.:

In [98]:
print('Of %d custodial deaths, there are %d unique offense_1 values' % (len(cdr), len(set(cdr.offense_1))))
cdr.offense_1.value_counts().head(20)

Of 8872 custodial deaths, there are 2516 unique offense_1 values


MURDER                                     412
AGGRAVATED SEXUAL ASSAULT OF A CHILD       233
AGGRAVATED ROBBERY                         169
BURGLARY OF HABITATION                     160
DRIVING WHILE INTOXICATED 3RD OR MORE      155
ROBBERY                                    154
HOMICIDE-MURDER                            153
SEXUAL ASSLT/RAPE AGG/SEX ABUSE            138
AGG SEXUAL ASSLT CHILD                     137
DRIVING WHILE INTOXICATED                  136
INDECENCY W/ A CHILD SEXUAL CONTACT        133
AGGRAVATED SEXUAL ASSAULT                  125
SEXUAL ASSLT/RAPE OF CHILD                 119
AGGRAVATED ASSAULT WITH A DEADLY WEAPON    116
INDECENCY WITH A CHILD                     113
CAPITAL MURDER                             113
AGGRAVATED ASSAULT                         106
PUBLIC INTOXICATION                         97
BURGLARY OF A HABITATION                    90
POSSESSION OF A CONTROLLED SUBSTANCE        86
Name: offense_1, dtype: int64

#### Clearly some of these could be condensed. See the possibilities for offenses containing the word 'MURDER':

In [99]:
s = cdr.offense_1.dropna()
murder = s[s.str.contains('MURDER')]
N = 20
s = murder.value_counts()[:N]
s['(...%d other types of murder)' % (len(set(murder)) - N)] = murder.value_counts()[N:].sum()
s

MURDER                                          412
HOMICIDE-MURDER                                 153
CAPITAL MURDER                                  113
ATTEMPTED CAPITAL MURDER                         72
MURDER WITH A DEADLY WEAPON                      63
MURDER INTENTIONALLY CAUSE DEATH                 24
HOMICIDE-CAPITAL MURDER                          19
MURDER WITH DEADLY WEAPON                        18
ATTEMPTED CAPITAL MURDER OF A POLICE OFFICER     17
ATTEMPTED CAPITAL MURDER OF A PEACE OFFICER      13
ATTEMPTED MURDER                                 12
CRIMINAL ATTEMPT CAPITAL MURDER                  11
HOMICIDE-ATTEMPT/ASSLT TO COMMIT MURDER           8
ATTEMPTED CAPITAL MURDER OF PEACE OFFICER         8
MURDER WITH A DEADLY WPN                          8
CAPITAL MURDER IN COMM OF CITED OFFENSES          7
ATT. CAPITAL MURDER                               6
MURDER INTENDS SBI CAUSING DEATH                  6
ATTEMPTED CAPITAL MURDER OF POLICE OFFICER        6
MURDER WITH 

#### Something like this might extract a useful simplified category (murders and attempted murders)

In [100]:
def is_attempt(x):
    parts = x.split()
    parts = [''.join(ch for ch in p if ch.isalpha()) for p in parts]
    for w in ['ATT', 'ATTEMPT', 'ATTEMPTED']:
        if w in parts:
            return True
    return False

att = murder.apply(is_attempt)
att_murder = murder[att]
comm_murder = murder[~att]
set(att_murder)

{'ATT CAPITAL MURDER',
 'ATT MURDER',
 'ATT MURDER WITH A DEADLY WPN',
 'ATT. CAPITAL MURDER',
 'ATT. CAPITAL MURDER OF PEACE OFFICER',
 'ATTEMPT CAP MURDER PO',
 'ATTEMPT CAPITAL MURDER',
 'ATTEMPT CAPITAL MURDER OF A POLICE OFFICER',
 'ATTEMPT CAPITAL MURDER-POLICE OFFICER',
 'ATTEMPT CAPITAL MURDER/ PEACE OFFICER',
 'ATTEMPT CAPITAL MURDER/POLICE OFFICER',
 'ATTEMPT CAPITOL MURDER WITH A DEADLY WEAPON',
 'ATTEMPT TO COMMIT CAPITAL MURDER',
 'ATTEMPTED CAP. MURDER OF A PEACE OFFICER',
 'ATTEMPTED CAPITAL MURDER',
 'ATTEMPTED CAPITAL MURDER (PEACE OFFICER)',
 'ATTEMPTED CAPITAL MURDER - PEACE OFFICER',
 'ATTEMPTED CAPITAL MURDER - PEACE OFFICER/FIREARM',
 'ATTEMPTED CAPITAL MURDER - POLICE OFFICER',
 'ATTEMPTED CAPITAL MURDER 5 COUNTS',
 'ATTEMPTED CAPITAL MURDER DEADLY WEAPON',
 'ATTEMPTED CAPITAL MURDER OF A CHILD',
 'ATTEMPTED CAPITAL MURDER OF A PEACE OFFICER',
 'ATTEMPTED CAPITAL MURDER OF A POLICE OFFICER',
 'ATTEMPTED CAPITAL MURDER OF A POLICE OFFICER (X 3)',
 'ATTEMPTED CAPIT

### [TODO] Manner of death is sometimes 'OTHER' when it should not be
Two columns detail the manner of death: `manner_of_death` and `manner_of_death_description`. A handful of rows have `manner_of_death` = `OTHER` when they should have picked `ACCIDENT` or `HOMICIDE` -- evidenced by the information in the `manner_of_death_description` column:

In [101]:
frame = cdr[(cdr.manner_of_death == OTHER_SPECIFY) & (cdr.manner_of_death_description.notnull())]
for exempt in ('PENDING', 'DETERMIN'):
    frame = frame[~frame.manner_of_death_description.str.contains(exempt)]
print(len(frame), len(cdr))
frame[['form_version', 'manner_of_death_description', 'manner_of_death', 'medical_cause_of_death']]

83 8872


Unnamed: 0,form_version,manner_of_death_description,manner_of_death,medical_cause_of_death
0,V_2005,DREW HIS WEAPON ON OFFICERS AND WAS SHOT BY OF...,"OTHER, SPECIFY",MULTIPLE CENTER FIRE RIFLE WOUNDS
3432,V_2005,UNDERTIMENED,"OTHER, SPECIFY",UNDERTIMINED
3438,V_2005,HIGH SPEED MOTORCYCLE CRASH WHILE ATTEMPTING T...,"OTHER, SPECIFY",MULTIPLE BLUNT FORCE INJURIES
3410,V_2005,ELECTROCUTION,"OTHER, SPECIFY",ELECTROCUTION
3461,V_2005,ACCIDENTAL,"OTHER, SPECIFY",RIGHT INTRA CEREBRAL HEMORRAHAGE DUE TO HYPERT...
3357,V_2005,CARDIAC ARREST,"OTHER, SPECIFY",METABOLIC KETOACIDOSIS DUE TO DIABETES MELLITUS
3343,V_2005,VEHICLE PURSUIT/CRASH,"OTHER, SPECIFY",PENDING AUTOPSY
3858,V_2005,MV ACCIDENT,"OTHER, SPECIFY",MULTIPLE BLUNT FORCE INJURIES
3855,V_2005,NORTRIPTYLINE INTOXICATION,"OTHER, SPECIFY",NORTRIPTYLINE INTOXICATION WITH ASPIRATION OF ...
3853,V_2005,COMPLICATIONS OF SEPSIS DUE TO PERICARDITIS DU...,"OTHER, SPECIFY",COMPLICATIONS OF SEPSIS DUE TO PERICARDITIS DU...


### [TODO] (Related) Extract motor vehicle accidents into a unique type

In [102]:
# This starter code should get it going
frame = cdr[(cdr.manner_of_death != 'ACCIDENTAL')]
frame = frame[(frame.medical_cause_of_death.fillna('').str.contains('ACCIDENT')) |
              frame.manner_of_death_description.fillna('').str.contains('ACCIDENT')]
print(len(frame))
frame[['medical_cause_of_death', 'means_of_death', 'means_of_death_other', 'manner_of_death', 'manner_of_death_description']]

43


Unnamed: 0,medical_cause_of_death,means_of_death,means_of_death_other,manner_of_death,manner_of_death_description
3450,COMBINED ETHANOL AND COCAINE TOXICITY - ACCIDE...,NOT APPLICABLE,,ALCOHOL/DRUG INTOXICATION,
3461,RIGHT INTRA CEREBRAL HEMORRAHAGE DUE TO HYPERT...,NOT APPLICABLE,,"OTHER, SPECIFY",ACCIDENTAL
3366,CEREBRAL VASCULAR ACCIDENT WITH SECONDARY HEMO...,NOT APPLICABLE,,NATURAL CAUSES/ILLNESS,CEREBRAL VASCULAR ACCIDENT WITH SECONDARY HEMO...
3858,MULTIPLE BLUNT FORCE INJURIES,"OTHER, SPECIFY",MV ACCIDENT,"OTHER, SPECIFY",MV ACCIDENT
3855,NORTRIPTYLINE INTOXICATION WITH ASPIRATION OF ...,NOT APPLICABLE,,"OTHER, SPECIFY",NORTRIPTYLINE INTOXICATION
3693,ANOXIC ENCEPHALOPATHY AND MULTIPLE ORGAN FAILU...,"OTHER, SPECIFY",UNKNOWN,"OTHER, SPECIFY",ACCIDENT
3658,"MYCARDIAL INFARCTION INCLUDING HYPERTENSION, H...",NOT APPLICABLE,,NATURAL CAUSES/ILLNESS,CARDIAC ARREST
2718,ACUTE CEREBROVASCULAR ACCIDENT (STROKE),NOT APPLICABLE,,NATURAL CAUSES/ILLNESS,STROKE
2950,ACCIDENTAL OVERDOSE,"OTHER, SPECIFY",UNKNOWN,ALCOHOL/DRUG INTOXICATION,
5069,MASSIVE LEFT CEREBRAL VASCULAR ACCIDENT (STROKE),NOT APPLICABLE,,NATURAL CAUSES/ILLNESS,STROKE


### [TODO] Fix up `medical_treatment`

In [103]:
pd.crosstab(cdr.medical_treatment, cdr.form_version).sort_values(['V_2005'],ascending=False)

form_version,V_2005,V_2016,V_BJS
medical_treatment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NOT APPLICABLE,2871,226,29
YES,1435,241,2334
NO,804,175,7
UNKNOWN,0,535,60


### [TODO] Remove `medical_examinor_coroner_evalution`? Not sure it's useful.

In [104]:
pd.crosstab(cdr.medical_examinor_coroner_evalution, cdr.form_version).sort_values(['V_2005'],ascending=False)

form_version,V_2005,V_2016,V_BJS
medical_examinor_coroner_evalution,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"YES, RESULTS ARE AVAILABLE",4216,341,40
"NO, EVALUATION NOT PLANNED",702,370,9
"YES, RESULTS PENDING",158,466,1
"NO, EVALUATION PENDING",35,0,2
DON'T KNOW,0,0,5
NO,0,0,1679
YES,0,0,694


### [TODO] Somehow blend the `entry_behavior`/`other_behavior` field from V_2005 with the `exhibit_any_medical_problems`, `exhibit_any_mental_health_problems` and `make_suicidal_statements` from V_2016

In [105]:
pd.crosstab(cdr.entry_behavior, cdr.form_version).sort_values(['V_2005'],ascending=False)

form_version,V_2005,V_BJS
entry_behavior,Unnamed: 1_level_1,Unnamed: 2_level_1
MEDICAL,349,1
INTOXICATED,192,1
MENTAL,82,1
MENTAL; MEDICAL,41,0
INTOXICATED; MEDICAL,29,0
INTOXICATED; MENTAL,18,0
INTOXICATED; MENTAL; MEDICAL,9,0
MEDICAL; NA,4,0
INTOXICATED; NA,1,0


In [106]:
pd.crosstab(cdr.exhibit_any_medical_problems, cdr.form_version).sort_values(['V_2016'],ascending=False)

form_version,V_2016
exhibit_any_medical_problems,Unnamed: 1_level_1
UNKNOWN,463
NO,311
YES,268


In [107]:
pd.crosstab(cdr.exhibit_any_mental_health_problems, cdr.form_version).sort_values(['V_2016'],ascending=False)

form_version,V_2016
exhibit_any_mental_health_problems,Unnamed: 1_level_1
UNKNOWN,590
NO,346
YES,100


In [108]:
pd.crosstab(cdr.make_suicidal_statements, cdr.form_version).sort_values(['V_2016'],ascending=False)

form_version,V_2016
make_suicidal_statements,Unnamed: 1_level_1
UNKNOWN,539
NO,418
YES,79


### [TODO] Align various binary behavioral columns from each form version, add to final output

In [109]:
# These columns were completely removed. Some can stay once paired off
drop_cols

{'Agency Phone',
 'Agency State',
 'Appear intoxicated (alcohol or drugs)',
 "Attempt gain possession officer's weapon",
 'Attempt to Injure Others?',
 'Barricade self or initiate standoff?',
 'Custody Type Facility',
 'Death Reason',
 'Death Weapon Other Description',
 'Decedent Display or Use Weapon Details',
 'Decedent display/use of weapons',
 'Director First Name',
 'Director Last Name',
 'Director Middle Name',
 'Director Salutation',
 'Escape or attempt to escape/flee custody',
 "Gain possession of officer's weapon",
 'Grab, hit or fight with the officer(s)',
 'Injured By',
 'Location Category',
 'Medical Treatment Description',
 'Other Agencies Respond?',
 'Other Location Category',
 'Other device, specify',
 'Other weapon, specify',
 'Physically attempt/assault officer(s)',
 'Reporter Email',
 'Reporter Name',
 'Reporter Name Original CDR',
 'Resist being handcuffed or arrested?',
 'Specify Weapon Used',
 'Status',
 'TDCJ - Specify Unit',
 'Threaten the officer(s) involved',
 