# Clean and reformat CDR data from its multi-tab excel file into a single csv

### About the data

CDR data is tricky -- the form used by law enforcement has changed over time, first in 2005, then again in 2016. The data before 2005 is known to be be sparse and poorly enforced, so we ignore those entries. The 2005 and 2016 versions of the form have some overlap and some differences, so we must be careful in how we merge them.

**In this repo you can find blank versions of the [2005](https://github.com/texas-justice-initiative/data-processing/blob/master/forms/CDR%20Form%20Version%202005.pdf) and [2016](https://github.com/texas-justice-initiative/data-processing/blob/master/forms/CDR%20Form%20Version%202016.pdf) forms, to see for yourself exactly what fields are collected and how.**

### Datasets used


* Input:
  * `tji/deaths-in-custody/CDR - All Reports.xlsx`
  * `tji/auxiliary-datasets/agencies_and_counties`
* Output:
  * `tji/deaths-in-custody/cleaned_custodial_death_reports.csv`
  
##### Author: Everett Wetchler (everett.wetchler@gmail.com)

## Setup and read data

In [1]:
DTW_PROJECT_KEY_CDR = 'tji/deaths-in-custody'
RAW_FILENAME = 'original/CDR Reports All.xlsx'
SHEETNAMES = ['Form Version 2005', 'Form Version 2016', 'Older Forms']
CLEANED_FILENAME = 'cleaned_custodial_death_reports.csv'

In [2]:
import datetime

import datadotworld as dw
import json
import numpy as np
import pandas as pd

from lib.cleaning_tools import *

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

%load_ext watermark
%watermark -a "Everett Wetchler" -d -t -z -w -p numpy,pandas,datadotworld

Everett Wetchler 2018-05-28 23:07:03 CDT

numpy 1.14.3
pandas 0.22.0
datadotworld 1.6.0
watermark 1.6.0


In [3]:
from lib.standardize_police_agency_names import standardize_agency_name

In [4]:
datasets = dw.load_dataset('tji/auxiliary-datasets', force_update=True)
agency_county = datasets.dataframes['agencies_and_counties']
agency_county = agency_county.set_index('agency')['county'].to_dict()

In [5]:
dfs = dict(zip(SHEETNAMES, read_dtw_excel(DTW_PROJECT_KEY_CDR, RAW_FILENAME, sheet_names=SHEETNAMES)))
dfs['Form Version 2005']['form_version'] = 'V_2005'
dfs['Form Version 2016']['form_version'] = 'V_2016'
dfs['Older Forms']['form_version'] = 'V_OLDER'
cdr = pd.concat([dfs['Form Version 2005'], dfs['Form Version 2016']])
print('Read %d raw CDR records. Ignoring %d from older form versions (pre-2005).' % (
    sum(len(x) for x in dfs), len(dfs['Older Forms'])))
print('Keeping %d records using form version 2005, and %d using version 2016 (keeping %d in total)' % (
    len(dfs['Form Version 2005']), len(dfs['Form Version 2016']), len(cdr)))

Writing excel file to temp file: /var/folders/dc/8cbxbsh515s908xl0zyprszm0000gn/T/tmprxsppj0v
Read 45 raw CDR records. Ignoring 5468 from older form versions (pre-2005).
Keeping 5156 records using form version 2005, and 1115 using version 2016 (keeping 6271 in total)


### In theory, all these records should be for deaths in 2005 or later. Let's double check and drop any miscreants.

In [6]:
assert cdr['Death Date and Time'].isnull().sum() == 0

In [7]:
before = len(cdr)
cdr = cdr[cdr['Death Date and Time'].dt.year >= 2005]
after = len(cdr)
print('Dropped %d (of %d) reports for deaths before 2005, leaving %d' % (before - after, before, after))

Dropped 6 (of 6271) reports for deaths before 2005, leaving 6265


### A quick look at the raw data

In [8]:
dfs['Form Version 2005'].head()

Unnamed: 0,CDR: CDR Name,Report Date,Status,Version Type,Version Number,Department Type,Agency Name,Agency Phone,Agency Address,Agency City,Agency County,Agency State,Agency Zip,Director Salutation,Director First Name,Director Middle Name,Director Last Name,Reporter Name Original CDR,Reporter Email,Street Address,City,County,Longitude,Latitude,Census Tract,Type of Custody,Specific Type of Custody/Facility,Custody Type Facility,Entry Date Time,Entry Date Time N/A,Death Location,Death Location Elsewhere,First Name,Middle Name,Last Name,Suffix,Date of Birth,Sex,Ethnicity,Ethnicity Other,Death Date and Time,Age At Time Of Death,Date/Time of Custody or Incident,Custody Date NA,Medical Examinor/Coroner Evalution?,Manner of Death,Manner of Death Description,Death Reason,Medical Cause of Death,Medical Treatment,Medical Treatment Description,Who caused the death?,Death Causer Other,Type of Death Weapon,Death Weapon Other Description,Pre existing medical condition?,Means of Death,Means of Death Other,Offense 1,Offense 2,Offense 3,Were the Charges:,Type of Offense,"Type of Offense, Other",Injured By,Threaten the officer(s) involved,Resist being handcuffed or arrested?,Try to escape/flee from custody,"Grab, hit or fight with the officer(s)",Other Behavior,Specify Other Behavior,Appear intoxicated (alcohol or drugs),Use weapon threaten/assault officer(s),Entry Behavior,Under Restraint,Type of Restraint,"Other device, specify",form_version
0,PA05001C,2005-03-02 14:48:00,Submitted,ORIGINAL VERSION,1,POLICE,Fort Worth Police Dept.,8178778022,350 W. Belknap,Fort Worth,TARRANT,TX,76102,Chief,Ralph,,Mendoza,Renee Gray,renee.gray@oag.state.tx.us,1509 W. Hammond,Fort Worth,Tarrant,,,,Police Custody (pre-booking),Custody of Peace Officer during/fleeing arrest,,NaT,1,At the crime/arrest scene,,Dino,,Gomez,,1964-04-03,Male,Hispanic,,2005-01-01 15:04:00,40,2005-01-01 15:04:00,0,"Yes, results are available",Justifiable Homicide,,Injuries only,Gunshot Wound to the Chest,Not Applicable,,Law enforcement/correctional staff,,Handgun,,Not Applicable; cause of death was accidental ...,Firearm,,Aggravated Assault,,,Not filed at time of death,,,Injured by Officer,Yes,Yes,Yes,No,0,,No,1,,No,,,V_2005
1,PA05002CJ,2005-03-03 14:07:00,Submitted,ORIGINAL VERSION,1,SHERIFF,Travis County Sheriff's Dept.,5128549770,P. O. Box 1748,Austin,TRAVIS,TX,78767,Sheriff,Margo,L.,Frasier,Renee Gray,renee.gray@oag.state.tx.us,3614 Bill Price Road,Del Valle,Travis,,,,County Jail,Jail - multiple occupancy cell,,2004-10-20 20:56:00,0,At medical facility,,Michael,Darnell,Dickson,,1953-03-04,Male,African-American,,2005-01-01 15:59:00,51,2004-10-20 20:56:00,0,"Yes, results are available",Natural Causes/Illness,Cardiac Arrest,Medical condition only (e.g. heart attack),Myocardial infarction,Yes,"Treatment for hypertension, administered Proca...","Not applicable; cause of death was suicide, in...",,Not Applicable,,Pre-existing medical condition,Not applicable; cause of death was intoxicatio...,,Man/Del/Sell/Poss Controlled Substance,,,Filed,,,Injured by NA,No,No,No,No,0,,No,0,Medical,No,,,V_2005
2,PA05003P,2005-03-03 14:16:00,Submitted,ORIGINAL VERSION,1,STAGENCY,Texas Department Of Criminal Justice,9364376716,P.O. Box 4003,Huntsville,TRAVIS,TX,773424003,Mr.,Chris,H.,Stallings,Renee Gray,renee.gray@oag.state.tx.us,21 FM 247,Huntsville,Walker,,,,Penitentiary,TDCJ,Byrd,2004-12-17 09:00:00,0,At medical facility,,Darryl,Glenn,Wallace,,1960-12-03,Male,African-American,,2005-01-02 04:35:00,44,2004-12-17 09:00:00,0,"Yes, results are available",Natural Causes/Illness,Cardiac,Medical condition only (e.g. heart attack),Cardiac death secondary to cardiac ischemia th...,Yes,Nitroglycerin,"Not applicable; cause of death was suicide, in...",,Not Applicable,,Pre-existing medical condition,Not applicable; cause of death was intoxicatio...,,Burglary of Habitation,,,Convicted,,,Injured by NA,No,No,No,No,0,,No,0,,No,,,V_2005
3,PA05004C,2005-03-03 14:40:00,Submitted,ORIGINAL VERSION,1,POLICE,San Antonio Police Dept.,2102077449,"214 W. Nueva, Suite 331",San Antonio,BEXAR,TX,78207,Chief,Albert,,Ortiz,Renee Gray,renee.gray@oag.state.tx.us,2102 Goliad Road,San Antonio,Bexar,,,,Police Custody (pre-booking),Custody of Peace Officer during/fleeing arrest,,NaT,1,At the crime/arrest scene,,Albert,Chavez,Enriquez,,1983-11-28,Male,Hispanic,,2005-01-03 15:00:00,21,2005-01-03 14:50:00,0,"Yes, results are available",Justifiable Homicide,,Injuries only,Multiple Gunshot Wounds,Not Applicable,,Law enforcement/correctional staff,,Handgun,,Not Applicable; cause of death was accidental ...,Firearm,,Aggravated Robbery,,,Not filed at time of death,,,Injured by Officer,Yes,Yes,Yes,Yes,0,,No,1,,No,,,V_2005
4,PA05005C,2005-03-03 14:51:00,Submitted,ORIGINAL VERSION,1,POLICE,Houston Police Dept.,7133081778,"1200 Travis, 17th Floor",Houston,HARRIS,TX,77002,Chief,Clarence,O.,Bradford,Renee Gray,renee.gray@oag.state.tx.us,3600 block of Telephone Rd,Houston,Harris,,,,Police Custody (pre-booking),Custody of Peace Officer during/fleeing arrest,,NaT,1,At the crime/arrest scene,,Alex,,Mendez,,1978-12-30,Male,Hispanic,,2005-01-03 17:15:00,26,2005-01-03 17:15:00,0,"Yes, results are available",Justifiable Homicide,,Injuries only,Multiple Gunshot Wounds,Not Applicable,,Law enforcement/correctional staff,,Handgun,,Not Applicable; cause of death was accidental ...,Firearm,,Traffic Violation,,,Not filed at time of death,,Traffic Violation,Injured by Officer,Yes,No,No,No,0,,No,1,,No,,,V_2005


In [9]:
dfs['Form Version 2016'].head()

Unnamed: 0,CDR: CDR Name,Version Type,Version Number,Report Date,Status,Agency Name,Agency Address,Agency City,Agency State,Agency Zip,Director Salutation,Director First Name,Director Middle Name,Director Last Name,Reporter Name,Reporter Email,First Name,Middle Name,Last Name,Suffix,Date of Birth,Sex,Race,Age At Time Of Death,Date/Time of Custody or Incident,Death Date and Time,Medical Examinor/Coroner Evalution?,Manner of Death,Manner of Death Description,Medical Cause of Death,Medical Treatment,Who caused the death?,Type of weapon that caused death?,"Other weapon, specify",Pre existing medical condition?,Means of Death,Means of Death Other,Street Address,City,County,Zip,Longitude,Latitude,Census Tract,Location Category,Other Location Category,Type of Custody,Specific Type of Custody/Facility,TDCJ - Specify Unit,Entry Date Time,Entry Date Time N/A,Death Location,Death Location Elsewhere,Other Agencies Respond?,Offense 1,Offense 2,Offense 3,Were the Charges:,Type of Offense,"Type of Offense, Other",Decedent display/use of weapons,Decedent Display or Use Weapon Details,Specify Weapon Used,Attempt to Injure Others?,Appear intoxicated (alcohol or drugs),Make suicidal statements?,Exhibit any mental health problems?,Exhibit any medical problems?,Barricade self or initiate standoff?,Resist being handcuffed or arrested?,Physically attempt/assault officer(s),Gain possession of officer's weapon,Verbally threaten other(s) including law,Escape or attempt to escape/flee custody,Attempt gain possession officer's weapon,Under Restraint,Type of Restraint,"Other device, specify",form_version
0,16-3-C,AMENDED,4,2016-12-12 13:02:00,Submitted,Texas Department Of Public Safety,PO BOX 4087,Austin,TX,78773,Director,Steven,,McCraw,Joanne Scarbrough,joanne.scarbrough@dps.texas.gov,Ivory,Charles,Pantallion,III,1980-09-14,Male,Black or African American,36,2016-11-22 09:26:00,2016-11-22 09:26:00,"Yes, results are available",Homicide (includes Justifiable Homicide),,Multiple Gunshot Wounds,Not Applicable,Law enforcement/correctional personnel,Handgun; Rifle/shotgun,,Not Applicable; cause of death was accidental ...,Firearm,,7300 Interstate 10 W,Baytown,Harris,77521.0,,,,Roadway/highway/street/sidewalk,,Police Custody (pre-booking),Custody of Law Enforcement Personnel during/fl...,,NaT,1,Scene of incident,,Yes,Aggravated Assault on Peace Officer,Evading Arrest or Detention,,Not filed at time of death,Violent Crime Against Persons,,"Yes, mark all that apply",Displayed firearm without discharge,,Yes (select all that apply),No,No,No,No,Unknown,Yes,Yes,No,Unknown,Yes,No,No,,,V_2016
1,14-1-C,AMENDED,2,2016-12-13 16:20:00,Submitted,Texas Department Of Public Safety,PO BOX 4087,Austin,TX,78773,Director,Steven,,McCraw,Joanne Scarbrough,joanne.scarbrough@dps.texas.gov,James,Earl,Nicholas,,1966-04-24,Male,Anglo or White,48,2014-09-11 17:00:00,2014-09-11 17:00:00,"Yes, results are available",Could not be determined,,Multiple Gunshot Wounds,Not Applicable,Law enforcement/correctional personnel,"Firearm, unspecified",,Not Applicable; cause of death was accidental ...,Firearm,,2030 Jacintoport Blvd.,Houston,Harris,77015.0,,,,Roadway/highway/street/sidewalk,,Police Custody (pre-booking),Custody of Law Enforcement Personnel during/fl...,,NaT,1,Scene of incident,,Yes,Capital Murder - Filed,Aggravated Assault on a Peace Officer - Not Filed,Evading Arrest or Detention - Not Filed,Filed,Violent Crime Against Persons,,"Yes, mark all that apply",Discharged firearm,,Yes (select all that apply),No,No,No,No,No,Yes,Yes,No,Unknown,Yes,No,No,,,V_2016
2,16-4-P,ORIGINAL VERSION,1,2016-12-14 15:27:00,Submitted,TDCJ/Office of the Inspector General,"2503 Lake Road, Suite 5",Huntsville,TX,77340,Other,John,,West,Analou Sievers,analou.sievers@tdcj.texas.gov,Percy,,Froman,,1969-12-19,Male,Anglo or White,46,2000-01-28 00:00:00,2016-11-17 08:25:00,"No, evaluation not planned",Natural,,Colon Cancer,No,Not applicable,Not Applicable,,Pre-existing medical condition,"Not applicable, cause of death was illness/nat...",,8602 Peach Street,Lubbock,Lubbock,79404.0,,,,Law Enforcement Facility,,Penitentiary,"TDCJ, specify",Montford,2000-01-28 00:00:00,0,Medical facility,,No,Aggravated Robbery,,,Convicted,Violent Crime Against Persons,,No,,,No,,,,,,,,,,,,No,,,V_2016
3,16-5-C,ORIGINAL VERSION,1,2016-12-14 18:37:00,Submitted,San Antonio Police Dept.,315 S. Santa Rosa,San Antonio,TX,78207,Chief,William,,McManus,Leroy Carrion,leroy.carrion@sanantonio.gov,Andrew,,Moreno,,1991-12-23,Male,Hispanic or Latino,24,2016-11-19 21:00:00,2016-11-20 05:34:00,"Yes, results are available",Homicide (includes Justifiable Homicide),,Multiple gunshot wounds,No,Law enforcement/correctional personnel,Rifle/shotgun,,Not Applicable; cause of death was accidental ...,Firearm,,5814 Shadow Glen #4,San Antonio,Bexar,78238.0,,,,Residence/Home,,Police Custody (pre-booking),Custody of Law Enforcement Personnel during/fl...,,2016-11-19 21:00:00,0,Scene of incident,,Yes,Aggravated Assault with a Deadly Weapon,,,Filed,Violent Crime Against Persons,,"Yes, mark all that apply",Displayed firearm without discharge,,Yes (select all that apply),Unknown,Yes,No,No,Yes,No,Yes,No,Yes,No,No,No,,,V_2016
4,16-6-MJ,ORIGINAL VERSION,1,2016-12-15 11:20:00,Submitted,Rosenberg Police Dept.,2120 Fourth St.,Rosenberg,TX,77471,Chief,Dallis,,Warren,Charles Crocker,justin.crocker@rosenbergtx.gov,Roberto,Eduardo,Velasquez,,1977-01-31,Male,Hispanic or Latino,39,2016-12-02 22:32:00,2016-12-03 09:15:00,"Yes, results pending",Pending autopsy results,,Pending autopsy results.,No,Not applicable,Not Applicable,,Pre-existing medical condition,"Not applicable, cause of death was illness/nat...",,1910 Louise #41,Rosenberg,Fort Bend,77471.0,,,,Residence/Home,,Municipal Jail,Jail - detox cell,,2016-12-02 22:32:00,0,Medical facility,,No,TCIC warrant - Failure to Appear (Child Neglect),,,Filed,"Other, specify",Investigation of disturbance call at residence...,No,,,No,Yes,No,No,No,No,No,No,No,No,No,No,No,,,V_2016


## Helper methods

In [10]:
def show_notnull_chart(cdr, vertical=False):
    '''Show which columns have null values, how often, and break down by form version.'''
    versions = sorted(set(cdr.form_version))
    frames = []
    for v in versions:
        fr = cdr[cdr.form_version == v]
        s = fr.notnull().mean()
        frames.append(s)
    frame = pd.concat(frames, axis=1)
    frame.columns = versions
    frame.sort_index(inplace=True)
    if not vertical:
        frame = frame.T
    return frame.style.background_gradient(cmap='RdYlGn', axis=(0 if vertical else 1))

In [11]:
OTHER_SPECIFY = 'OTHER, SPECIFY'

-----
## Begin Cleaning
-----

### First, take a look at how often our columns are populated, broken down by form version (since each ask slightly different questions)

In [12]:
show_notnull_chart(cdr)

Unnamed: 0,Age At Time Of Death,Agency Address,Agency City,Agency County,Agency Name,Agency Phone,Agency State,Agency Zip,Appear intoxicated (alcohol or drugs),Attempt gain possession officer's weapon,Attempt to Injure Others?,Barricade self or initiate standoff?,CDR: CDR Name,Census Tract,City,County,Custody Date NA,Custody Type Facility,Date of Birth,Date/Time of Custody or Incident,Death Causer Other,Death Date and Time,Death Location,Death Location Elsewhere,Death Reason,Death Weapon Other Description,Decedent Display or Use Weapon Details,Decedent display/use of weapons,Department Type,Director First Name,Director Last Name,Director Middle Name,Director Salutation,Entry Behavior,Entry Date Time,Entry Date Time N/A,Escape or attempt to escape/flee custody,Ethnicity,Ethnicity Other,Exhibit any medical problems?,Exhibit any mental health problems?,First Name,Gain possession of officer's weapon,"Grab, hit or fight with the officer(s)",Injured By,Last Name,Latitude,Location Category,Longitude,Make suicidal statements?,Manner of Death,Manner of Death Description,Means of Death,Means of Death Other,Medical Cause of Death,Medical Examinor/Coroner Evalution?,Medical Treatment,Medical Treatment Description,Middle Name,Offense 1,Offense 2,Offense 3,Other Agencies Respond?,Other Behavior,Other Location Category,"Other device, specify","Other weapon, specify",Physically attempt/assault officer(s),Pre existing medical condition?,Race,Report Date,Reporter Email,Reporter Name,Reporter Name Original CDR,Resist being handcuffed or arrested?,Sex,Specific Type of Custody/Facility,Specify Other Behavior,Specify Weapon Used,Status,Street Address,Suffix,TDCJ - Specify Unit,Threaten the officer(s) involved,Try to escape/flee from custody,Type of Custody,Type of Death Weapon,Type of Offense,"Type of Offense, Other",Type of Restraint,Type of weapon that caused death?,Under Restraint,Use weapon threaten/assault officer(s),Verbally threaten other(s) including law,Version Number,Version Type,Were the Charges:,Who caused the death?,Zip,form_version
V_2005,1,1,1,0.999612,1,1,0.993398,1,0.998447,0.0,0,0.0,1,0,0.998835,1,1,0.512427,1,1,0.00912621,1,1,0.0192233,0.999806,0.0333981,0.0,0,0.999806,1,1,0.202524,1,0.141748,0.693592,1,0.0,1,0.00427184,0.0,0.0,1,0.0,0.998641,0.999806,1,0,0,0,0.0,1,0.589126,1,0.0460194,0.999417,1,0.999806,0.279223,0.559806,0.998835,0.261553,0.0819417,0,1,0.0,0.0159223,0.0,0.0,1,0,1.0,1.0,0,0.999806,0.998641,1,1.0,0.0341748,0.0,1,0.999806,0.0706796,0.0,0.999417,0.999029,1,0.999417,0.00194175,0.149126,0.00038835,0.0,0.999806,1,0.0,1,1,1,1,0.0,1
V_2016,1,1,1,0.0,1,0,1.0,1,0.875336,0.64843,1,0.644843,1,0,0.999103,1,0,0.0,1,1,0.0,1,1,0.0197309,0.0,0.0,0.164126,1,0.0,1,1,0.0125561,1,0.0,0.773991,1,0.643946,0,0.0,0.880717,0.875336,1,0.647534,0.0,0.0,1,0,1,0,0.874439,1,0.0107623,1,0.0286996,0.996413,1,1.0,0.0,0.373094,0.994619,0.266368,0.104036,1,0,0.0403587,0.00717489,0.032287,0.649327,1,1,0.999103,0.999103,1,0.0,0.636771,1,0.955157,0.0,0.0170404,1,1.0,0.0466368,0.66278,0.0,0.0,1,0.0,0.999103,0.119283,0.0547085,0.999103,1.0,0,0.650224,1,1,1,1,0.999103,1


### Start by dropping columns that we either don't need or haven't written code to clean yet.

In [13]:
keep_text = '''Both forms

- Age At Time Of Death
- Agency Address
- Agency City
- Agency Name
- Agency Zip
- CDR: CDR Name
- Census Tract
- City
- County
- Date of Birth
- Date/Time of Custody or Incident
- Death Date and Time
- Death Location
- Death Location Elsewhere
- Entry Date Time
- Entry Date Time N/A
- First Name
- Middle Name
- Last Name
- Latitude
- Longitude
- Suffix
- Manner of Death
- Manner of Death Description
- Means of Death
- Means of Death Other
- Medical Cause of Death
- Medical Examinor/Coroner Evalution?
- Medical Treatment
- Offense 1
- Offense 2
- Offense 3
- Pre existing medical condition?
- Report Date
- Sex
- Specific Type of Custody/Facility
- Street Address
- Type of Custody
- Type of Offense
- Type of Offense, Other
- Version Number
- Version Type
- Were the Charges:
- Who caused the death?
- form_version

2005 form only

- Agency County
- Custody Date NA
- Death Causer Other
- Department Type
- Entry Behavior
- Ethnicity
- Ethnicity Other
- Other Behavior
- Specify Other Behavior

2016 form only

- Exhibit any medical problems?
- Exhibit any mental health problems?
- Make suicidal statements?
- Race'''
keep_cols = []
drop_cols = set(cdr.columns)
for line in keep_text.splitlines():
    if line.startswith('- '):
        colname = line[2:]
        keep_cols.append(colname)
        drop_cols.remove(colname)

### Rename columns to be more machine friendly (lowercase, snake_case, and remove non-alphanumeric characters)

In [14]:
col_renames = {}
for c in keep_cols:
    new_name = ''.join([ch if ch.isalnum() else ' ' for ch in c.lower()])
    new_name = '_'.join(new_name.strip().split())
    col_renames[c] = new_name

In [15]:
cdr = cdr[list(col_renames.keys())]
cdr.rename(col_renames, inplace=True, axis=1)

## Add rows from the old master file to get BJS inpatient deaths (will dedup after)

In [16]:
datasets = dw.load_dataset('tji/raw-and-processing', force_update=True)
old_master = datasets.dataframes['reformatted_cdr_2017_master_file']
old_master['form_version'] = 'V_BJS'
print(old_master.shape)
old_master.head()

(7735, 47)


Unnamed: 0,first_name,middle_name,last_name,suffix,age_at_time_of_death,sex,date_of_birth,offense_1,offense_2,offense_3,report_date,death_date_and_time,cdr_cdr_name,agency_name,agency_address,agency_city,agency_zip,agency_county,who_caused_the_death,death_causer_other,date_time_of_custody_or_incident,custody_date_na,type_of_custody,specific_type_of_custody_facility,street_address,city,county,death_location,death_location_elsewhere,means_of_death,means_of_death_other,manner_of_death,manner_of_death_description,entry_date_time,entry_date_time_n_a,medical_cause_of_death,pre_existing_medical_condition,medical_examinor_coroner_evalution,medical_treatment,ethnicity,ethnicity_other,other_behavior,entry_behavior,were_the_charges,type_of_offense,type_of_offense_other,form_version
0,Clyde,L,Pady,,80.0,Male,2024-10-01,SEXUAL ASSLT/RAPE OF CHILD,SEXUAL ASSLT/RAPE OF CHILD,SEXUAL ASSLT/RAPE OF CHILD,3/1/2005 0:00,2005-01-02,PRISON20051001,Texas Department Of Criminal Justice,,,,,,,5/30/1995 0:00,,Penitentiary,TDCJ,,Tenessee Colony,,,,,,Natural Causes/Illness,Prostate/Lung Cancer,,,,Developed condition after admission,No,Yes,Anglo or White,,,,Convicted,child,,V_BJS
1,Winston,,Wilkins,,61.0,Male,1943-09-27,"STOLEN VEHICLE, UNAUTHORIZED USE",0,0,3/1/2005 0:00,2005-01-02,PRISON20051002,Texas Department Of Criminal Justice,,,,,,,9/8/2003 0:00,,Penitentiary,TDCJ,,Galveston,,,,,,Natural Causes/Illness,Lung Cancer,,,,Pre-existing medical condition,No,Yes,Anglo or White,,,,Convicted,property,,V_BJS
2,Steven,,Fleming,,37.0,Male,1967-10-30,SEXUAL ASSLT/RAPE AGG/SEX ABUSE,0,0,3/1/2005 0:00,2005-01-04,PRISON20051004,Texas Department Of Criminal Justice,,,,,,,10/7/1988 0:00,,Penitentiary,TDCJ,,Livingston,,,,,,Natural Causes/Illness,,,,,Pre-existing medical condition,No,Yes,Black or African American,,,,Convicted,violent,,V_BJS
3,James,B,Walker,,61.0,Male,1943-08-27,AGG SEXUAL ASSLT,AGG KIDNAPPING,0,3/1/2005 0:00,2005-01-04,PRISON20051005,Texas Department Of Criminal Justice,,,,,,,4/23/2004 0:00,,Penitentiary,TDCJ,,Tenessee Colony,,,,,,Natural Causes/Illness,Hepatocellular Cancer,,,,Developed condition after admission,No,Yes,Black or African American,,,,Convicted,violent,,V_BJS
4,David,L,Jones,,45.0,Male,1960-05-01,INDECENCY W/ A CHILD SEXUAL CONTACT,0,0,3/1/2005 0:00,2005-01-06,PRISON20051007,Texas Department Of Criminal Justice,,,,,,,11/8/2001 0:00,,Penitentiary,TDCJ,,Abilene,,,,,,Natural Causes/Illness,Cardiomyopathy Heart Failure,,,,Pre-existing medical condition,No,Yes,Anglo or White,,,,Convicted,child,,V_BJS


In [17]:
cdr.shape, old_master.shape

((6265, 58), (7735, 47))

#### There will be a few columns that the old master file doesn't have, which is fine

In [18]:
set(cdr.columns) - set(old_master.columns)

{'census_tract',
 'department_type',
 'exhibit_any_medical_problems',
 'exhibit_any_mental_health_problems',
 'latitude',
 'longitude',
 'make_suicidal_statements',
 'race',
 'specify_other_behavior',
 'version_number',
 'version_type'}

#### However, it shouldn't have any novel columns that the *other* data doesn't have

In [19]:
assert(len(set(old_master.columns) - set(cdr.columns)) == 0)

In [20]:
cdr = pd.concat([cdr, old_master])
cdr.reset_index(inplace=True, drop=True)
cdr.sort_values('form_version', inplace=True)
cdr.head()

Unnamed: 0,age_at_time_of_death,agency_address,agency_city,agency_county,agency_name,agency_zip,cdr_cdr_name,census_tract,city,county,custody_date_na,date_of_birth,date_time_of_custody_or_incident,death_causer_other,death_date_and_time,death_location,death_location_elsewhere,department_type,entry_behavior,entry_date_time,entry_date_time_n_a,ethnicity,ethnicity_other,exhibit_any_medical_problems,exhibit_any_mental_health_problems,first_name,form_version,last_name,latitude,longitude,make_suicidal_statements,manner_of_death,manner_of_death_description,means_of_death,means_of_death_other,medical_cause_of_death,medical_examinor_coroner_evalution,medical_treatment,middle_name,offense_1,offense_2,offense_3,other_behavior,pre_existing_medical_condition,race,report_date,sex,specific_type_of_custody_facility,specify_other_behavior,street_address,suffix,type_of_custody,type_of_offense,type_of_offense_other,version_number,version_type,were_the_charges,who_caused_the_death
0,40.0,350 W. Belknap,Fort Worth,TARRANT,Fort Worth Police Dept.,76102,PA05001C,,Fort Worth,Tarrant,0.0,1964-04-03,2005-01-01 15:04:00,,2005-01-01 15:04:00,At the crime/arrest scene,,POLICE,,NaT,1.0,Hispanic,,,,Dino,V_2005,Gomez,,,,Justifiable Homicide,,Firearm,,Gunshot Wound to the Chest,"Yes, results are available",Not Applicable,,Aggravated Assault,,,0,Not Applicable; cause of death was accidental ...,,2005-03-02 14:48:00,Male,Custody of Peace Officer during/fleeing arrest,,1509 W. Hammond,,Police Custody (pre-booking),,,1.0,ORIGINAL VERSION,Not filed at time of death,Law enforcement/correctional staff
3437,33.0,133 N. Riverfront Blvd.,Dallas,Dallas,Dallas County Sheriff's Dept.,75207,PA14199CJ,,Dallas,Dallas,0.0,1981-03-23,2014-04-17 09:49:00,,2014-04-20 08:14:00,At medical facility,,SHERIFF,,2014-04-17 09:49:00,0.0,Anglo,,,,Alisha,V_2005,Skeats,,,,Natural Causes/Illness,Complications of chronic opiate use,Not applicable; cause of death was intoxicatio...,,"Complications of chronic opiate, benzodiazepin...","Yes, results are available",Yes,,Theft of Service,,,0,Don't know,,2014-04-28 09:27:00,Female,Hospital/Infirmary,,Parkland Hospital,,County Jail,,property crime,1.0,ORIGINAL VERSION,Filed,"Not applicable; cause of death was suicide, in..."
3436,64.0,"2503 Lake Road, Suite 5",Huntsville,Walker,Texas Department Of Criminal Justice,77340,PA14198P,,Tennessee Colony,Anderson,0.0,1949-11-02,1998-10-06 00:00:00,,2014-04-23 09:07:00,At law enforcement facility,,STAGENCY,,1998-10-06 00:00:00,0.0,Anglo,,,,Orville,V_2005,McCoy,,,,Natural Causes/Illness,Malignant Pleural Effusion GT Bleesing/Anemia,Not applicable; cause of death was intoxicatio...,,ESLD-Liver Disease,"No, evaluation not planned",Yes,Lee,Aggravated Sexual Assault of a Child,,,0,Don't know,,2014-04-25 15:15:00,Male,TDCJ,,"Michael Unit, 2664 FM 2054",,Penitentiary,,,1.0,ORIGINAL VERSION,Convicted,"Not applicable; cause of death was suicide, in..."
3435,53.0,"2503 Lake Road, Suite 5",Huntsville,Walker,Texas Department Of Criminal Justice,77340,PA14197P,,New Boston,Bowie,0.0,1960-05-11,2012-03-19 00:00:00,,2014-04-16 06:44:00,At law enforcement facility,,STAGENCY,,2012-04-25 00:00:00,0.0,Hispanic,,,,Pedro,V_2005,Juarez,,,,Natural Causes/Illness,Cardiac arrhythmia,Not applicable; cause of death was intoxicatio...,,Sudden unexpected death in schizophrenia due t...,"Yes, results are available",Not Applicable,,Aggravated Assault,,,0,Don't know,,2014-04-25 13:25:00,Male,TDCJ,,3899 State Hwy 98,Jr,Penitentiary,,,1.0,ORIGINAL VERSION,Convicted,"Not applicable; cause of death was suicide, in..."
3434,44.0,"2503 Lake Road, Suite 5",Huntsville,Walker,Texas Department Of Criminal Justice,77340,PA14196P,,Richmon,Fort Bend,0.0,1969-10-03,2014-02-03 00:00:00,,2014-04-23 13:35:00,At medical facility,,STAGENCY,,2014-02-03 00:00:00,0.0,African-American,,,,James,V_2005,Benjamin,,,,Natural Causes/Illness,Thromboembolism,Not applicable; cause of death was intoxicatio...,,Acute pulmonary thromboembolism,"Yes, results are available",Not Applicable,Neal,Felony Theft ( 1yr probation) SAFPF,,,0,Don't know,,2014-04-25 11:14:00,Male,TDCJ,,Jester 4 Unit-4 Jester Rd.,,Penitentiary,,Theft,1.0,ORIGINAL VERSION,Probation/Parole,"Not applicable; cause of death was suicide, in..."


### Upcase string cell contents

In [21]:
upcase_strip_string_cells(cdr)

### Add a `death_date` column (without the death time)

In [22]:
cdr['death_date'] = cdr.death_date_and_time.apply(lambda dt: datetime.date(dt.year, dt.month, dt.day))

In [23]:
cdr_full = cdr.copy()

## De-duplicate
* We de-dup by finding pairs (or sets) of records that have the same key personal information fields (`pii`).
* In theory we should able to de-dup by the CDR record id, but sometimes we'll see multiple records for the same person with different record IDs.

In [24]:
pii_cols = ['first_name', 'last_name', 'date_of_birth', 'death_date']
dup_pii = cdr.duplicated(subset=pii_cols)
dup_pii.name = 'dup_pii'
dup_record_id = cdr.duplicated(subset='cdr_cdr_name')
dup_record_id.name = 'dup_record_id'

In [25]:
pd.crosstab(dup_record_id, dup_pii)

dup_pii,False,True
dup_record_id,Unnamed: 1_level_1,Unnamed: 2_level_1
False,8690,47
True,40,5223


#### As you can see, MOST records with duplicate personal information also use the same record id, but a few do not.

In [26]:
dropped = cdr[dup_pii]
cdr = cdr[~dup_pii]

In [27]:
dup_record_id = cdr.duplicated(subset='cdr_cdr_name', keep=False)  # Note keep=False to mark both in a duplicate pair

In [28]:
print("Removed %d duplicate rows (%s)" % (len(dropped), ['cdr_cdr_name'] + pii_cols))
print("%d remain" % len(cdr))
print("For now, leaving %d questionable records where there exists another record with the same record id but different personal information" %
     dup_record_id.sum())

Removed 5270 duplicate rows (['cdr_cdr_name', 'first_name', 'last_name', 'date_of_birth', 'death_date'])
8730 remain
For now, leaving 80 questionable records where there exists another record with the same record id but different personal information


In [29]:
cdr[dup_record_id].sort_values(['cdr_cdr_name'])[['cdr_cdr_name'] + pii_cols].head(6)

Unnamed: 0,cdr_cdr_name,first_name,last_name,date_of_birth,death_date
6086,18-148-P,BRUCE,DANNER,1971-10-10,2018-02-26
6085,18-148-P,PEDRO,ZAMORA,1971-09-23,2018-02-23
41,PA05042P,ARTHUR,ETTER,2005-05-05,2005-03-01
6377,PA05042P,ARTHUR,ETTER,2023-03-02,2005-03-01
6552,PA05142CJ,FRED,GREEN,1952-07-14,2005-07-14
141,PA05142CJ,FRED,GREEN,2005-06-29,2005-07-14


## Standardize and merge race columns -- the 2005 form calls it 'ethnicity', the 2016 'race'

#### Have a look at the values first

In [30]:
cdr.race.value_counts()

ANGLO OR WHITE               449
HISPANIC OR LATINO           310
BLACK OR AFRICAN AMERICAN    269
OTHER                          5
ASIAN OR PACIFIC ISLANDER      2
Name: race, dtype: int64

In [31]:
cdr.ethnicity.value_counts()

ANGLO                               2093
AFRICAN-AMERICAN                    1513
HISPANIC                            1424
ANGLO OR WHITE                      1095
BLACK OR AFRICAN AMERICAN            754
HISPANIC OR LATINO                   691
OTHER/DON'T KNOW                      39
ASIAN                                 34
OTHER                                 22
AMERICAN INDIAN/ALASKA NATIVE         13
MIDDLE EAST                           12
NATIVE HAWAIIAN/PACIFIC ISLANDER       5
Name: ethnicity, dtype: int64

In [32]:
# Collapse the various OTHER-like values
cdr.loc[cdr.ethnicity.fillna('').str.contains('OTHER'), 'ethnicity'] = 'OTHER'

In [33]:
# When choosing the 'Other' ethnicity in the 2005 form version, there
# is a subsequent field to specify. Though clearly some of them are not
# truly 'other' ethnicities. See:
cdr.ethnicity_other.value_counts()

WHITE                  7
UNKNOWN                5
CAUCASIAN              4
0                      3
SUDANESE BLACK         1
BELIZE, BRAZILIAN      1
WHITE NON HISPANIC     1
CUBAN                  1
ARABIAN                1
ANGLO & MIDDLE EAST    1
UNITED KINGDOM         1
Name: ethnicity_other, dtype: int64

In [34]:
# Replace the nonsensical 0 with NaN
cdr.loc[cdr.ethnicity_other.astype(str) == '0', 'ethnicity_other'] = None

In [35]:
# Let's make sure nobody is filling out the "other ethnicity" column when they shouldn't...
cdr[(~(cdr.ethnicity == 'OTHER') & cdr.ethnicity_other.notnull())][['ethnicity', 'ethnicity_other']]

Unnamed: 0,ethnicity,ethnicity_other


In [36]:
# Good. Let's transfer those specified ethnicity_other values into
# the 'ethnicity' column, so we can merge everything at once.
other_eth = (cdr.ethnicity == 'OTHER')
print('Merging %d "ethnicity_other" values into the main "ethnicity" column' % other_eth.sum())
cdr.loc[other_eth, 'ethnicity'] = cdr.ethnicity_other[other_eth]
cdr.drop('ethnicity_other', axis=1, inplace=True)

Merging 61 "ethnicity_other" values into the main "ethnicity" column


In [37]:
# Make a single 'race' column that has merged, simplified values of race or ethnicity.
race_eth_list = []
for race, eth in zip(cdr.race, cdr.ethnicity):
    # Only one of (race, eth) should be set
    assert pd.isnull(race) or pd.isnull(eth)
    if pd.isnull(race):
        if pd.isnull(eth):
            race_eth_list.append(None)
            continue
        x = eth
    else:
        x = race
    race_eth_list.append(x)

In [38]:
cdr['race'] = race_eth_list
cdr.drop('ethnicity', axis=1, inplace=True)
cdr.race.value_counts()

ANGLO                               2093
ANGLO OR WHITE                      1544
AFRICAN-AMERICAN                    1513
HISPANIC                            1424
BLACK OR AFRICAN AMERICAN           1023
HISPANIC OR LATINO                  1001
ASIAN                                 34
AMERICAN INDIAN/ALASKA NATIVE         13
MIDDLE EAST                           12
WHITE                                  7
UNKNOWN                                5
OTHER                                  5
NATIVE HAWAIIAN/PACIFIC ISLANDER       5
CAUCASIAN                              4
ASIAN OR PACIFIC ISLANDER              2
BELIZE, BRAZILIAN                      1
SUDANESE BLACK                         1
WHITE NON HISPANIC                     1
CUBAN                                  1
ARABIAN                                1
ANGLO & MIDDLE EAST                    1
UNITED KINGDOM                         1
Name: race, dtype: int64

### Regularize race and several other columns

In [39]:
race_before = cdr.race.copy()
race_before.name = 'race_before'

In [40]:
standardize_gender_cols(cdr)
standardize_race_cols(cdr)
numericalize_age_cols(cdr)

cdr.race.value_counts()

WHITE       3650
BLACK       2537
HISPANIC    2425
OTHER         80
Name: race, dtype: int64

In [41]:
cdr.groupby([race_before, 'race']).size()

race_before                       race    
AFRICAN-AMERICAN                  BLACK       1513
AMERICAN INDIAN/ALASKA NATIVE     OTHER         13
ANGLO                             WHITE       2093
ANGLO & MIDDLE EAST               WHITE          1
ANGLO OR WHITE                    WHITE       1544
ARABIAN                           OTHER          1
ASIAN                             OTHER         34
ASIAN OR PACIFIC ISLANDER         OTHER          2
BELIZE, BRAZILIAN                 OTHER          1
BLACK OR AFRICAN AMERICAN         BLACK       1023
CAUCASIAN                         WHITE          4
CUBAN                             OTHER          1
HISPANIC                          HISPANIC    1424
HISPANIC OR LATINO                HISPANIC    1001
MIDDLE EAST                       OTHER         12
NATIVE HAWAIIAN/PACIFIC ISLANDER  OTHER          5
OTHER                             OTHER          5
SUDANESE BLACK                    BLACK          1
UNITED KINGDOM                    OTHER

## Standardize agency names, and add agency counties

In [42]:
# Standardize agency name (so we can join/compare across datasets)
cdr['agency_name'] = cdr['agency_name'].apply(standardize_agency_name)

# Lookup county name by agency name. If this fails, fall back
# on the county specified in the form, if it exists.
cdr['agency_county'] = cdr['agency_county'].str.upper()
county_lookup = cdr['agency_name'].apply(lambda name: agency_county.get(name, np.nan))
cdr['agency_county'] = county_lookup.fillna(cdr['agency_county'])

# Manually handle one major agency
cdr.loc[cdr['agency_name'] == 'TEXAS DEPT OF CRIMINAL JUSTICE', 'agency_county'] = 'STATE'

In [43]:
# Check that we are only missing counties for a paltry few records now.
cdr[cdr['agency_county'].isnull()]['agency_name'].value_counts()

TEXAS JUVENILE JUSTICE DEPT OFFICE OF INSPECTOR GENERAL     2
HILDAGO CO SHERIFFS OFFICE                                  2
GULF COAST VIOLENT OFFENDERS TASK FORCE                     1
NORTHSIDE INDEPENDENT SCHOOL DIST POLICE                    1
SUGARLAND POLICE DEPT                                       1
METROPOLITAN TRANSIT AUTH OF HARRIS CO                      1
ECTOR CO SHERIFFS OFC                                       1
BEAUMONT INDEPENDENT SCHOOL DIST POLICE DEPT                1
ELLIS CO SHERIFFS OFFICE WAYNE MCCOLLUM DETENTION CENTER    1
NACONA POLICE DEPT                                          1
Name: agency_name, dtype: int64

## FIx death-information related columns

#### The two form versions offer slightly different options(e.g. V_2005 uses 'AT MEDICAL FACILITY' while V_2016 uses 'MEDICAL FACILITY'). Collapse values to stabilize the options across form versions, and roll up rare values into a single 'OTHER' value.

### Fix: `death_location`

In [44]:
pd.crosstab(cdr.death_location, cdr.form_version)

form_version,V_2005,V_2016,V_BJS
death_location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AT LAW ENFORCEMENT FACILITY,1265,0,13
AT MEDICAL FACILITY,3031,0,26
AT THE CRIME/ARREST SCENE,661,0,7
DEAD ON ARRIVAL AT MEDICAL FACILITY,0,20,0
ELSEWHERE,97,0,2
"ELSEWHERE, SPECIFY",1,19,0
EN ROUTE TO BOOKING CENTER/POLICE LOCKUP,2,2,0
EN ROUTE TO MEDICAL FACILITY,53,0,1
LAW ENFORCEMENT FACILITY/BOOKING CENTER,0,169,0
MEDICAL FACILITY,0,696,0


In [45]:
replacements = {
    'AT MEDICAL FACILITY': 'MEDICAL FACILITY',
    'AT LAW ENFORCEMENT FACILITY': 'LAW ENFORCEMENT FACILITY',
    'AT THE CRIME/ARREST SCENE': 'CRIME/ARREST SCENE',
    'SCENE OF INCIDENT': 'CRIME/ARREST SCENE',
    'LAW ENFORCEMENT FACILITY/BOOKING CENTER': 'LAW ENFORCEMENT FACILITY',
    'DEAD ON ARRIVAL AT MEDICAL FACILITY': 'EN ROUTE TO MEDICAL FACILITY',
    'EN ROUTE TO BOOKING CENTER/POLICE LOCKUP': 'EN ROUTE TO LAW ENFORCEMENT FACILITY',
    'ELSEWHERE': OTHER_SPECIFY,
    'ELSEWHERE, SPECIFY': OTHER_SPECIFY,
}
cdr['death_location'] = cdr['death_location'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr.death_location, cdr.form_version)

form_version,V_2005,V_2016,V_BJS
death_location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CRIME/ARREST SCENE,662,129,7
EN ROUTE TO LAW ENFORCEMENT FACILITY,2,2,0
EN ROUTE TO MEDICAL FACILITY,53,20,1
LAW ENFORCEMENT FACILITY,1265,169,13
MEDICAL FACILITY,3031,696,26
"OTHER, SPECIFY",98,19,2


### Fix: `means_of_death`

In [46]:
pd.crosstab(cdr.means_of_death, cdr.form_version)

form_version,V_2005,V_2016,V_BJS
means_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BATON / BLUNT INSTRUMENT,0,1,0
BLUNT INSTRUMENT,12,0,0
DON'T KNOW,107,0,2
DON\'T KNOW,32,0,0
DRUG OVERDOSE,150,22,0
FIREARM,876,164,5
"HANGING, STRANGULATION",540,76,3
KNIFE / EDGED INSTRUMENT,0,3,0
"KNIFE, CUTTING INSTRUMENT",38,0,0
NOT APPLICABLE,0,0,2


In [47]:
replacements = {
    'NOT APPLICABLE, CAUSE OF DEATH WAS ILLNESS/NATURAL CAUSE': 'NOT APPLICABLE',
    'NOT APPLICABLE; CAUSE OF DEATH WAS INTOXICATION OR ILLNESS/NATURAL CAUSES': 'NOT APPLICABLE',
    'OTHER': OTHER_SPECIFY,
    'KNIFE, CUTTING INSTRUMENT': 'KNIFE / EDGED INSTRUMENT',
    'BLUNT INSTRUMENT': 'BATON / BLUNT INSTRUMENT',
    "DON'T KNOW": 'UNKNOWN',
    "DON\\'T KNOW": 'UNKNOWN',
    'RIFLE/SHOTGUN': 'FIREARM',
}
cdr['means_of_death'] = cdr['means_of_death'].apply(lambda x:  None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr.means_of_death, cdr.form_version)

form_version,V_2005,V_2016,V_BJS
means_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BATON / BLUNT INSTRUMENT,12,1,0
DRUG OVERDOSE,150,22,0
FIREARM,876,164,6
"HANGING, STRANGULATION",540,76,3
KNIFE / EDGED INSTRUMENT,38,3,0
NOT APPLICABLE,3122,660,38
"OTHER, SPECIFY",233,26,3
UNKNOWN,139,76,2
VEHICLE ACCIDENT,1,7,0


### It seems officers sometimes misuse the 'other' option, selecting it when another category is more appropriate:

In [48]:
cdr[cdr.means_of_death == OTHER_SPECIFY]['means_of_death_other'].value_counts().head()

MOTOR VEHICLE ACCIDENT                                         8
BLUNT FORCE TRAUMA RECEIVED DURING A MOTOR VEHICLE ACCIDENT    8
VEHICLE                                                        6
DROWNING                                                       6
FALL                                                           6
Name: means_of_death_other, dtype: int64

### Let's roll all rare and 'unknown'-type values into the 'other' category. 'VEHICLE ACCIDENT' in particular is new in the 2016 form, only indicated as 'other' in older forms.

In [49]:
other_values = ['UNKNOWN', 'VEHICLE ACCIDENT', 'KNIFE / EDGED INSTRUMENT', 'BATON / BLUNT INSTRUMENT']
indices = cdr['means_of_death'].isin(other_values)
cdr.loc[indices, 'means_of_death_other'] = cdr.loc[indices, 'means_of_death']
cdr.loc[indices, 'means_of_death'] = OTHER_SPECIFY

pd.crosstab(cdr.means_of_death, cdr.form_version)

form_version,V_2005,V_2016,V_BJS
means_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DRUG OVERDOSE,150,22,0
FIREARM,876,164,6
"HANGING, STRANGULATION",540,76,3
NOT APPLICABLE,3122,660,38
"OTHER, SPECIFY",423,113,5


### Fix: `manner_of_death` and  `manner_of_death_description`

In [50]:
pd.crosstab(cdr.form_version, cdr.manner_of_death).T

form_version,V_2005,V_2016,V_BJS
manner_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ACCIDENTAL,3,32,0
ACCIDENTAL INJURY CAUSED BY OTHERS,20,0,4
ACCIDENTAL INJURY TO SELF,155,0,2
ALCOHOL/DRUG INTOXICATION,288,6,0
COULD NOT BE DETERMINED,3,7,0
HOMICIDE (INCLUDES JUSTIFIABLE HOMICIDE),15,126,0
HOMICIDE BY LAW ENFORCEMENT/CORRECTIONAL STAFF,0,0,160
JUSTIFIABLE HOMICIDE,668,0,0
NATURAL,18,602,0
NATURAL CAUSES/ILLNESS,2856,0,2396


In [51]:
pd.crosstab(cdr.manner_of_death, cdr.manner_of_death_description.notnull())

manner_of_death_description,False,True
manner_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1
ACCIDENTAL,32,3
ACCIDENTAL INJURY CAUSED BY OTHERS,24,0
ACCIDENTAL INJURY TO SELF,157,0
ALCOHOL/DRUG INTOXICATION,294,0
COULD NOT BE DETERMINED,9,1
HOMICIDE (INCLUDES JUSTIFIABLE HOMICIDE),140,1
HOMICIDE BY LAW ENFORCEMENT/CORRECTIONAL STAFF,159,1
JUSTIFIABLE HOMICIDE,668,0
NATURAL,603,17
NATURAL CAUSES/ILLNESS,62,5190


In [52]:
replacements = {
    'NATURAL': 'NATURAL CAUSES/ILLNESS',
    'JUSTIFIABLE HOMICIDE': 'HOMICIDE',
    'HOMICIDE BY LAW ENFORCEMENT/CORRECTIONAL STAFF': 'HOMICIDE',
    'OTHER HOMICIDE': 'HOMICIDE',
    'HOMICIDE (INCLUDES JUSTIFIABLE HOMICIDE)': 'HOMICIDE',
    'ACCIDENTAL INJURY CAUSED BY OTHERS': 'ACCIDENTAL',
    'ACCIDENTAL INJURY TO SELF': 'ACCIDENTAL',
    'OTHER': OTHER_SPECIFY,
    'OTHER - SPECIFY': OTHER_SPECIFY,
}
cdr['manner_of_death'] = cdr['manner_of_death'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))

# In past versions, "pending autopsy results" was not an option, and reports had "other"
# checked with some mention of pending autopsy in the free field. We emulate this here
# to preserve consistency across form versions.
other_values = ['PENDING AUTOPSY RESULTS', 'COULD NOT BE DETERMINED']
indices = cdr['manner_of_death'].isin(other_values)
cdr.loc[indices, 'manner_of_death_description'] = cdr.loc[indices, 'manner_of_death']
cdr.loc[indices, 'manner_of_death'] = OTHER_SPECIFY

pd.crosstab(cdr.form_version, cdr.manner_of_death).T

form_version,V_2005,V_2016,V_BJS
manner_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ACCIDENTAL,178,32,6
ALCOHOL/DRUG INTOXICATION,288,6,0
HOMICIDE,799,126,160
NATURAL CAUSES/ILLNESS,2874,602,2396
"OTHER, SPECIFY",141,160,6
SUICIDE,831,109,16


### A few suicides by hanging have the wrong `means_of_death`

In [53]:
frame = cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death != 'HANGING, STRANGULATION')]
frame = frame[(frame.medical_cause_of_death.fillna('').str.contains('HANGING')) |
              frame.manner_of_death_description.fillna('').str.contains('HANGING')]
print(len(frame))
frame[['medical_cause_of_death', 'means_of_death', 'means_of_death_other', 'manner_of_death', 'manner_of_death_description']].head()

44


Unnamed: 0,medical_cause_of_death,means_of_death,means_of_death_other,manner_of_death,manner_of_death_description
3236,HANGING,"OTHER, SPECIFY",UNKNOWN,SUICIDE,
3331,ASPHYXIABY HANGING,NOT APPLICABLE,,SUICIDE,
2745,HANGING,NOT APPLICABLE,,SUICIDE,
992,PROVISIONAL DEATH IS ASPHYXIATION BY HANGING.,NOT APPLICABLE,,SUICIDE,
1058,SUICIDE BY HANGING,NOT APPLICABLE,,SUICIDE,


In [54]:
cdr.loc[frame.index, 'means_of_death'] = 'HANGING, STRANGULATION'

# Be sure we got them all
frame = cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death != 'HANGING, STRANGULATION')]
frame = frame[frame.medical_cause_of_death.fillna('').str.contains('HANGING')]
print(len(frame))

0


### For a few suicides, the `manner_of_death` is 'NOT APPLICABLE', which makes no sense. We need to change these to OTHER, though (as shown below) more is going on

In [55]:
frame = cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death == 'NOT APPLICABLE')]
print(len(frame))
frame[['means_of_death', 'means_of_death_other', 'manner_of_death', 'manner_of_death_description', 'medical_cause_of_death']].head()

17


Unnamed: 0,means_of_death,means_of_death_other,manner_of_death,manner_of_death_description,medical_cause_of_death
825,NOT APPLICABLE,,SUICIDE,,SUDDEN CARDIAC DEATH SECONDARY TO EXSANGUINATI...
1022,NOT APPLICABLE,,SUICIDE,,INCISED WOUNDS OF BOTH WRISTS AND RIGHT ARM
1094,NOT APPLICABLE,,SUICIDE,,MORPHINE AND CHLORPHENIRAMINE OVERDOSE
2239,NOT APPLICABLE,,SUICIDE,,GUNSHOT WOUNDS OF RIGHT TEMPLE
1969,NOT APPLICABLE,,SUICIDE,,INCISED WOUNDS OF LEFT ARM


In [56]:
cdr.loc[frame.index, 'means_of_death'] = OTHER_SPECIFY
len(cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death == 'NOT APPLICABLE')])

0

### Fix: `pre_existing_medical_condition` 

In [57]:
pd.crosstab(cdr.form_version, cdr.pre_existing_medical_condition).T

form_version,V_2005,V_2016,V_BJS
pre_existing_medical_condition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
COULD NOT BE DETERMINED,0,571,0
DECEASED DEVELOPED CONDITION AFTER ADMISSION,0,49,0
DEVELOPED CONDITION AFTER ADMISSION,154,0,167
DON'T KNOW,1142,0,301
DON\'T KNOW,834,0,1
"NOT APPLICABLE; CAUSE OF DEATH WAS ACCIDENTAL INJURY, INTOXICATION, SUICIDE OR HOMICIDE",2044,303,32
PRE-EXISTING MEDICAL CONDITION,937,112,1927


In [58]:
replacements = {
    'DECEASED DEVELOPED CONDITION AFTER ADMISSION': 'DEVELOPED CONDITION AFTER ADMISSION',
    "DON'T KNOW": 'UNKNOWN',
    "DON\\'T KNOW": 'UNKNOWN',
    'NOT APPLICABLE; CAUSE OF DEATH WAS ACCIDENTAL INJURY, INTOXICATION, SUICIDE OR HOMICIDE': 'NOT APPLICABLE',
    'COULD NOT BE DETERMINED': 'UNKNOWN',
    'PRE-EXISTING MEDICAL CONDITION': 'YES',
}
cdr['pre_existing_medical_condition'] = cdr['pre_existing_medical_condition'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr.form_version, cdr.pre_existing_medical_condition).T

form_version,V_2005,V_2016,V_BJS
pre_existing_medical_condition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DEVELOPED CONDITION AFTER ADMISSION,154,49,167
NOT APPLICABLE,2044,303,32
UNKNOWN,1976,571,302
YES,937,112,1927


### Fix: `who_caused_the_death` 

NOTE: This question is framed as follows:
* 2005 form: "If the death was an accident or homicide, who caused the death?"
* 2016 form: "If the death was an accident, homicide **or suicide**, who caused the death?" (emphasis added)

Thus, we need to:
1. Collapse near-identical values from different forms, similar to the other areas here.
1. Remove suicides from the 2016 responses, as they skew the data (see below). While we're add it, change ANY entries that are not of type homicide/suicide to have "NOT APPLICABLE" as the value.

#### Collapse near-identical values from different forms, similar to the other areas here.

In [59]:
pd.crosstab(cdr.who_caused_the_death, cdr.death_date_and_time.dt.year)

death_date_and_time,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
who_caused_the_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
DECEASED,11,13,14,13,12,14,5,13,19,12,13,14,0,0
DECEDENT,0,0,0,0,0,0,0,0,0,0,0,8,112,30
DON'T KNOW,3,4,6,15,23,24,21,11,8,5,1,0,0,0
DON\'T KNOW,0,0,0,0,0,0,0,0,0,2,18,14,0,0
LAW ENFORCEMENT/CORRECTIONAL PERSONNEL,0,0,0,0,0,0,0,0,1,2,2,14,78,35
LAW ENFORCEMENT/CORRECTIONAL STAFF,41,45,66,65,74,67,58,97,90,92,114,69,0,0
NOT APPLICABLE,0,0,0,0,0,0,0,0,0,0,0,60,500,168
"NOT APPLICABLE; CAUSE OF DEATH WAS SUICIDE, INTOXICATION OR ILLNESS/NATURAL CAUSES",232,262,290,255,227,217,230,238,566,524,550,442,0,0
OTHER CIVILIAN(S),0,0,0,0,0,0,0,0,0,0,0,0,1,0
OTHER DETAINEE(S),0,0,0,0,0,0,0,0,0,0,0,2,5,1


In [60]:
replacements = {
    'DECEASED': 'DECEDENT',
    "DON'T KNOW": 'UNKNOWN',
    "DON\\'T KNOW": 'UNKNOWN',
    'LAW ENFORCEMENT/CORRECTIONAL STAFF': 'LAW ENFORCEMENT/CORRECTIONAL PERSONNEL',
    'NOT APPLICABLE; CAUSE OF DEATH WAS SUICIDE, INTOXICATION OR ILLNESS/NATURAL CAUSES': 'NOT APPLICABLE',
    'OTHER DETAINEES': 'OTHER DETAINEE(S)',
    'OTHER PERSONS': 'OTHER CIVILIAN(S)',
    'ACCIDENTAL INJURY TO SELF': 'ACCIDENTAL',
    'UNKNOWN PERSON(S) CAUSED THE INJURY': 'UNKNOWN',
    'UNKNOWN WHETHER DECEDENT SUSTAINED A FATAL INJURY': 'UNKNOWN',
}
cdr['who_caused_the_death'] = cdr['who_caused_the_death'].apply(lambda x:  None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['who_caused_the_death'], cdr.form_version)

form_version,V_2005,V_2016,V_BJS
who_caused_the_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DECEDENT,153,150,0
LAW ENFORCEMENT/CORRECTIONAL PERSONNEL,719,131,160
NOT APPLICABLE,3993,728,40
OTHER CIVILIAN(S),46,1,0
OTHER DETAINEE(S),48,8,3
UNKNOWN,152,17,3


#### Remove suicides from the 2016 responses to 'who caused the death' question, as in previous years 'who caused the death' was only asked for homicides and accidents. (see below)

In [61]:
pd.crosstab(cdr.who_caused_the_death, cdr.manner_of_death)

manner_of_death,ACCIDENTAL,ALCOHOL/DRUG INTOXICATION,HOMICIDE,NATURAL CAUSES/ILLNESS,"OTHER, SPECIFY",SUICIDE
who_caused_the_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
DECEDENT,112,18,4,3,32,134
LAW ENFORCEMENT/CORRECTIONAL PERSONNEL,2,2,972,0,34,0
NOT APPLICABLE,57,250,22,3456,168,808
OTHER CIVILIAN(S),11,1,20,2,12,1
OTHER DETAINEE(S),8,0,48,0,3,0
UNKNOWN,23,23,19,48,56,3


In [62]:
cdr.loc[~cdr.manner_of_death.isin(['HOMICIDE', 'ACCIDENTAL', OTHER_SPECIFY]), 'who_caused_the_death'] = 'NOT APPLICABLE'
pd.crosstab(cdr.who_caused_the_death, cdr.manner_of_death)

manner_of_death,ACCIDENTAL,ALCOHOL/DRUG INTOXICATION,HOMICIDE,NATURAL CAUSES/ILLNESS,"OTHER, SPECIFY",SUICIDE
who_caused_the_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
DECEDENT,112,0,4,0,32,0
LAW ENFORCEMENT/CORRECTIONAL PERSONNEL,2,0,972,0,34,0
NOT APPLICABLE,57,294,22,5872,168,956
OTHER CIVILIAN(S),11,0,20,0,12,0
OTHER DETAINEE(S),8,0,48,0,3,0
UNKNOWN,23,0,19,0,56,0


### Fix: `were_the_charges` 

In [63]:
pd.crosstab(cdr['were_the_charges'], cdr.form_version)

form_version,V_2005,V_2016,V_BJS
were_the_charges,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A PROBATION/PAROLE VIOLATION,0,12,0
CONVICTED,2840,708,2411
FILED,832,88,5
NOT FILED AT TIME OF DEATH,1331,227,14
PROBATION/PAROLE,108,0,0


In [64]:
replacements = {
    'PROBATION/PAROLE': 'PROBATION/PAROLE VIOLATION',
    'A PROBATION/PAROLE VIOLATION': 'PROBATION/PAROLE VIOLATION',
}
cdr['were_the_charges'] = cdr['were_the_charges'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['were_the_charges'], cdr.form_version)

form_version,V_2005,V_2016,V_BJS
were_the_charges,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CONVICTED,2840,708,2411
FILED,832,88,5
NOT FILED AT TIME OF DEATH,1331,227,14
PROBATION/PAROLE VIOLATION,108,12,0


### Fix `type_of_custody`

In [65]:
pd.crosstab(cdr['type_of_custody'], cdr.form_version)

form_version,V_2005,V_2016,V_BJS
type_of_custody,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
COUNTY JAIL,1034,113,12
MUNICIPAL JAIL,143,14,0
PENITENTIARY,2642,690,2409
POLICE CUSTODY (PRE-BOOKING),1276,108,163
PRE-CUSTODIAL USE OF FORCE,0,95,0
PRIVATE CORRECTIONAL FACILITY,16,0,0
PRIVATE FACILITY,0,15,0


In [66]:
replacements = {
    'PRE-CUSTODIAL USE OF FORCE': 'POLICE CUSTODY (PRE-BOOKING)',
    'PRIVATE CORRECTIONAL FACILITY': 'PRIVATE FACILITY',
    'COUNTY JAIL': 'JAIL',
    'MUNICIPAL JAIL': 'JAIL',
    'PENITENTIARY': 'PRISON',
}
cdr['type_of_custody'] = cdr['type_of_custody'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['type_of_custody'], cdr.form_version)

form_version,V_2005,V_2016,V_BJS
type_of_custody,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
JAIL,1177,127,12
POLICE CUSTODY (PRE-BOOKING),1276,203,163
PRISON,2642,690,2409
PRIVATE FACILITY,16,15,0


### Fix `specific_type_of_custody_facility`

In [67]:
pd.crosstab(cdr.specific_type_of_custody_facility, cdr.form_version)

form_version,V_2005,V_2016,V_BJS
specific_type_of_custody_facility,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CORRECTIONAL/REHABILITATION FACILITY,33,1,1
CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FLEEING ARREST,6,120,0
CUSTODY OF LAW ENFORCEMENT PERSONNEL SUBSEQUENT TO ARREST,2,38,0
CUSTODY OF PEACE OFFICER DURING/FLEEING ARREST,940,0,7
CUSTODY OF PEACE OFFICER SUBSEQUENT TO ARREST,241,0,3
HALFWAY HOUSE/RESTITUTION CENTER,12,0,0
HOSPITAL/INFIRMARY,305,41,3
JAIL - DAY ROOM/RECREATION AREA,25,4,1
JAIL - DETOX CELL,60,5,0
JAIL - HOLDING CELL,124,15,4


In [68]:
replacements = {
    'TDCJ, SPECIFY': 'TDCJ',
    'CUSTODY OF PEACE OFFICER DURING/FLEEING ARREST': 'CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FLEEING ARREST',
    'CUSTODY OF PEACE OFFICER SUBSEQUENT TO ARREST': 'CUSTODY OF LAW ENFORCEMENT PERSONNEL AFTER ARREST',
    'CUSTODY OF LAW ENFORCEMENT PERSONNEL SUBSEQUENT TO ARREST': 'CUSTODY OF LAW ENFORCEMENT PERSONNEL AFTER ARREST',
    'TEXAS-JUVENILE JUSTICE DEPARTMENT - FACILITY/DETENTION CENTER, SPECIFY': 'OTHER',
    'TJPC': 'OTHER',
    'TYC': 'OTHER',
    'HALFWAY HOUSE/RESTITUTION CENTER': 'OTHER',
    'CORRECTIONAL/REHABILITATION FACILITY': 'OTHER',
    'NON-LAW ENFORCEMENT DETOX FACILITY': 'OTHER',
}
cdr['specific_type_of_custody_facility'] = cdr['specific_type_of_custody_facility'].apply(
    lambda x: x if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['specific_type_of_custody_facility'], cdr.form_version)

form_version,V_2005,V_2016,V_BJS
specific_type_of_custody_facility,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CUSTODY OF LAW ENFORCEMENT PERSONNEL AFTER ARREST,243,38,3
CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FLEEING ARREST,946,120,7
HOSPITAL/INFIRMARY,305,41,3
JAIL - DAY ROOM/RECREATION AREA,25,4,1
JAIL - DETOX CELL,60,5,0
JAIL - HOLDING CELL,124,15,4
JAIL - MULTIPLE OCCUPANCY CELL,380,33,2
JAIL - SINGLE CELL,357,42,0
OTHER,51,3,1
TDCJ,2620,689,2409


### Glance at the types of custody crossed with facility

In [69]:
pd.crosstab(cdr.specific_type_of_custody_facility, cdr.type_of_custody)

type_of_custody,JAIL,POLICE CUSTODY (PRE-BOOKING),PRISON,PRIVATE FACILITY
specific_type_of_custody_facility,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CUSTODY OF LAW ENFORCEMENT PERSONNEL AFTER ARREST,16,268,0,0
CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FLEEING ARREST,3,1070,0,0
HOSPITAL/INFIRMARY,240,93,3,13
JAIL - DAY ROOM/RECREATION AREA,30,0,0,0
JAIL - DETOX CELL,64,1,0,0
JAIL - HOLDING CELL,138,5,0,0
JAIL - MULTIPLE OCCUPANCY CELL,412,1,0,2
JAIL - SINGLE CELL,393,0,3,3
OTHER,18,3,22,12
TDCJ,2,2,5713,1


#### Identify and drop a range of unnecessary columns

In [70]:
cdr.groupby([cdr.entry_date_time.isnull(), cdr.entry_date_time_n_a]).size().unstack()

entry_date_time_n_a,0.0,1.0
entry_date_time,Unnamed: 1_level_1,Unnamed: 2_level_1
False,4400.0,16.0
True,,1779.0


In [71]:
pd.crosstab(cdr.custody_date_na, cdr['date_time_of_custody_or_incident'].isnull())

date_time_of_custody_or_incident,False
custody_date_na,Unnamed: 1_level_1
0.0,5113
1.0,1


In [72]:
cdr.drop(['entry_date_time_n_a', 'custody_date_na'], axis=1, inplace=True)

In [73]:
pd.crosstab(cdr.type_of_offense.notnull(), cdr.form_version)

form_version,V_2005,V_2016,V_BJS
type_of_offense,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,5101,1,162
True,10,1034,2422


In [74]:
pd.crosstab(cdr.type_of_offense_other.notnull(), cdr.form_version)

form_version,V_2005,V_2016,V_BJS
type_of_offense_other,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,4350,910,2575
True,761,125,9


In [75]:
cdr.drop(['type_of_offense', 'type_of_offense_other'], axis=1, inplace=True)

In [76]:
pd.crosstab(cdr.department_type.notnull(), cdr.form_version)

form_version,V_2005,V_2016,V_BJS
department_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,1,1035,2584
True,5110,0,0


In [77]:
cdr.drop('department_type', axis=1, inplace=True)

In [78]:
cdr['other_behavior'].value_counts()

0.0    4936
1.0     175
Name: other_behavior, dtype: int64

In [79]:
pd.crosstab(cdr['other_behavior'], cdr['specify_other_behavior'].notnull())

specify_other_behavior,False,True
other_behavior,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,4936,0
1.0,0,175


In [80]:
cdr['other_behavior'] = cdr['specify_other_behavior']
cdr.drop('specify_other_behavior', axis=1, inplace=True)

## Add new columns

### Column for the time (in days) between incarceration (or incident) and death
Note: for a few of these, the death date/time is before the incarceration/incident date. If they are only one day apart, we just call it 0 and assume it was a minor error. If they are more than a day apart, clearly there was a larger error, so we use a NaN value.

In [81]:
def get_days(dt):
    if dt.days < -1:
        return None
    elif dt.days == -1:
        return 0
    else:
        return dt.days

delta = cdr.death_date_and_time - cdr.date_time_of_custody_or_incident
print("For %d records with death date before custoday date, setting the days_from_custody_to_death to NaN" % (delta.dt.days < -1).sum())
cdr['days_from_custody_to_death'] = delta.apply(get_days)

For 5 records with death date before custoday date, setting the days_from_custody_to_death to NaN


In [82]:
cdr['name_full'] = ''
for col in ['first_name', 'middle_name', 'last_name', 'suffix']:
    cdr['name_full'] = cdr['name_full'] + ' ' + cdr[col].fillna('')
cdr['name_full'] = cdr['name_full'].apply(lambda s: ' '.join(s.strip().split()))
cdr.loc[cdr['name_full'] == '', 'name_full'] = np.nan

## Tweak bookkeping columns

In [83]:
cdr['num_revisions'] = cdr['version_number'] - 1
cdr.drop(['version_type', 'version_number'], axis=1, inplace=True)

## Rename and reorder columns sensibly

In [84]:
col_renames = {
    'first_name': 'name_first',
    'middle_name': 'name_middle',
    'last_name': 'name_last',
    'suffix': 'name_suffix',
    'cdr_cdr_name': 'record_id',
    'death_causer_other': 'who_caused_death_in_homicide_or_accident_other',
    'who_caused_the_death': 'who_caused_death_in_homicide_or_accident',
    'death_location': 'death_location_type',
    'death_location_elsewhere': 'death_location_type_other',
    'city': 'death_location_city',
    'county': 'death_location_county',
    'latitude': 'death_location_latitude',
    'longitude': 'death_location_longitude',
    'census_tract': 'death_location_census_tract',
    'street_address': 'death_location_street_address',
    'entry_date_time': 'facility_entry_date_time',
    'pre_existing_medical_condition': 'death_from_pre_existing_medical_condition',
}

In [85]:
cdr.rename(col_renames, axis=1, inplace=True)

In [86]:
after = cdr[[
    # Record indexing columns
    'record_id',
    'num_revisions',
    'form_version',
    'report_date',
    'date_time_of_custody_or_incident',

    # Deceased personal information, demographics
    'name_first',
    'name_last',
    'name_middle',
    'name_suffix',
    'name_full',
    'date_of_birth',
    'age_at_time_of_death',
    'sex',
    'race',

    # Death event information
    'death_date',
    'death_date_and_time',
    'death_location_county',
    'death_location_city',
    'death_location_street_address',
    'death_location_latitude',
    'death_location_longitude',
    'death_location_census_tract',
    'death_location_type',
    'death_location_type_other',
    'death_from_pre_existing_medical_condition',
    'manner_of_death',
    'manner_of_death_description',
    'means_of_death',
    'means_of_death_other',
    'medical_cause_of_death',
    'medical_examinor_coroner_evalution',
    'medical_treatment',
    'days_from_custody_to_death',
    'who_caused_death_in_homicide_or_accident',
    'who_caused_death_in_homicide_or_accident_other',

    # Criminal information on deceased
    'offense_1',
    'offense_2',
    'offense_3',
    'were_the_charges',

    # Facility and agency information
    'facility_entry_date_time',
    'type_of_custody',
    'specific_type_of_custody_facility',
    'agency_address',
    'agency_city',
    'agency_county',
    'agency_name',
    'agency_zip',
    
    # Deceased behavior upon entry or custody
    'entry_behavior',
    'other_behavior',
    'exhibit_any_medical_problems',
    'exhibit_any_mental_health_problems',
    'make_suicidal_statements',
]]
assert cdr.shape == after.shape
cdr = after

In [87]:
cdr.sample(10)

Unnamed: 0,record_id,num_revisions,form_version,report_date,date_time_of_custody_or_incident,name_first,name_last,name_middle,name_suffix,name_full,date_of_birth,age_at_time_of_death,sex,race,death_date,death_date_and_time,death_location_county,death_location_city,death_location_street_address,death_location_latitude,death_location_longitude,death_location_census_tract,death_location_type,death_location_type_other,death_from_pre_existing_medical_condition,manner_of_death,manner_of_death_description,means_of_death,means_of_death_other,medical_cause_of_death,medical_examinor_coroner_evalution,medical_treatment,days_from_custody_to_death,who_caused_death_in_homicide_or_accident,who_caused_death_in_homicide_or_accident_other,offense_1,offense_2,offense_3,were_the_charges,facility_entry_date_time,type_of_custody,specific_type_of_custody_facility,agency_address,agency_city,agency_county,agency_name,agency_zip,entry_behavior,other_behavior,exhibit_any_medical_problems,exhibit_any_mental_health_problems,make_suicidal_statements
2759,PA13185P,0.0,V_2005,2013-05-02 10:44:00,1989-11-10 00:00:00,FREDERICK,JOHNSON,RUSSELL,JR.,FREDERICK RUSSELL JOHNSON JR.,1960-11-16,52.0,M,WHITE,2013-04-22,2013-04-22 13:15:00,WALKER,HUNTSVILLE,861-B I-45 N,,,,"OTHER, SPECIFY",TDCJ BOT COMPLEX - WORK ASSIGNMENT,UNKNOWN,NATURAL CAUSES/ILLNESS,BRAIN STEM HEMORRHAGE,NOT APPLICABLE,,BRAIN STEM HEMORRHAGE,"YES, RESULTS ARE AVAILABLE",NOT APPLICABLE,8564.0,NOT APPLICABLE,,BURGLARY OF A HABITATION,,,CONVICTED,1989-11-10 00:00:00,PRISON,TDCJ,"2503 LAKE ROAD, SUITE 5",HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77340.0,,,,,
5607,17-372-CJ,1.0,V_2016,2017-07-18 13:34:00,2017-07-17 22:41:00,MICHAEL,RIVERA,ANTHONY,,MICHAEL ANTHONY RIVERA,1982-07-23,34.0,M,WHITE,2017-07-18,2017-07-18 02:28:00,VICTORIA,VICTORIA,101 N GLASS,,,,MEDICAL FACILITY,,NOT APPLICABLE,ACCIDENTAL,,DRUG OVERDOSE,,TOXIC EFFECTS OF METHAMPHETAMINE,"YES, RESULTS ARE AVAILABLE",UNKNOWN,0.0,NOT APPLICABLE,,POSSESSION OF MARIJUANA UNDER 2 OUNCES,,,NOT FILED AT TIME OF DEATH,2017-07-17 22:21:00,JAIL,JAIL - HOLDING CELL,101 N. GLASS,VICTORIA,VICTORIA,VICTORIA CO SHERIFFS OFFICE,77901.0,,,YES,NO,NO
8497,PRISON20083025,,V_BJS,2008-09-01 00:00:00,1996-06-26 00:00:00,OLLIE,GREEN,P,,OLLIE P GREEN,1951-10-07,57.0,M,BLACK,2008-07-18,2008-07-18 00:00:00,,GALVESTON,,,,,,,YES,NATURAL CAUSES/ILLNESS,RESPIRATORY FAILURE,,,,NO,YES,4405.0,NOT APPLICABLE,,SEXUAL ASSLT/RAPE OF CHILD,0,0.0,CONVICTED,,PRISON,TDCJ,,,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,,,,,,
4012,PA15168C,0.0,V_2005,2015-04-16 08:39:00,2015-03-09 17:01:00,JUAN,TORRES-SALDANA,,,JUAN TORRES-SALDANA,1947-09-15,67.0,M,HISPANIC,2015-03-09,2015-03-09 19:28:00,WEBB,LAREDO,1807 CONVENT AVENUE,,,,CRIME/ARREST SCENE,,NOT APPLICABLE,HOMICIDE,,FIREARM,,TWO GUNSHOT WOUNDS OF THE FACE,"YES, RESULTS ARE AVAILABLE",NOT APPLICABLE,0.0,LAW ENFORCEMENT/CORRECTIONAL PERSONNEL,,MURDER F/1,AGG. ASSAULT AGAINST PEACE OFFICER F/1,,FILED,NaT,POLICE CUSTODY (PRE-BOOKING),CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FL...,4712 MAHER AVE,LAREDO,WEBB,LAREDO POLICE DEPT,78041.0,,SHOT AND KILLED TWO MINORS,,,
10957,PRISON2012352,,V_BJS,2012-09-01 00:00:00,2011-11-17 00:00:00,RICHARD,CARTWRIGHT,,,RICHARD CARTWRIGHT,1944-07-31,68.0,M,WHITE,2012-08-09,2012-08-09 00:00:00,,TENESSEE COLONY,,,,,,,YES,NATURAL CAUSES/ILLNESS,THROAT CANCER,,,,NO,YES,266.0,NOT APPLICABLE,,SEX OFFENSE-AGAINST CHILD-FONDLING,0,0.0,CONVICTED,,PRISON,TDCJ,,,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,,,,,,
606,PA07006CJ,0.0,V_2005,2007-01-18 09:07:00,2007-01-15 19:15:00,RENEE,SPERRY,,,RENEE SPERRY,1961-10-10,45.0,F,WHITE,2007-01-16,2007-01-16 17:00:00,HARRIS,HOUSTON,1200 BAKER,,,,MEDICAL FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,CARDIOVASCULAR DISEASE,NOT APPLICABLE,,HYPERTENSIVE CARDIOVASCULAR DISEASE,"YES, RESULTS ARE AVAILABLE",NO,0.0,NOT APPLICABLE,,THEFT $50-$500,,,CONVICTED,2007-01-15 19:15:00,JAIL,HOSPITAL/INFIRMARY,1200 BAKER STREET,HOUSTON,HARRIS,HARRIS CO SHERIFFS OFFICE,77002.0,,,,,
2001,PA11071CJ,0.0,V_2005,2011-04-13 13:24:00,2011-02-08 11:45:00,JERRY,LAWSON,DION,,JERRY DION LAWSON,1963-08-08,47.0,M,WHITE,2011-04-08,2011-04-08 08:43:00,SAN JACINTO,COLDSPRING,75 W. CEDAR AVE.,,,,LAW ENFORCEMENT FACILITY,,UNKNOWN,ALCOHOL/DRUG INTOXICATION,,"OTHER, SPECIFY",UNKNOWN,COMBINED DURG TOXICITY WITH LETHAL LEVEL OF HY...,"YES, RESULTS ARE AVAILABLE",NOT APPLICABLE,58.0,NOT APPLICABLE,,DWI 3RD OR MORE,,,CONVICTED,2011-02-08 11:45:00,JAIL,JAIL - MULTIPLE OCCUPANCY CELL,75 W. CEDAR AVENUE,COLDSPRING,SAN JACINTO,SAN JACINTO CO SHERIFFS OFFICE,77331.0,,,,,
6082,17-716-CJ,0.0,V_2016,2018-03-13 13:54:00,2017-08-18 11:24:00,OTTO,GUILLORY,DALE,,OTTO DALE GUILLORY,1954-12-20,62.0,M,BLACK,2017-12-10,2017-12-10 11:09:00,HARRIS,HOUSTON,5656 KELLEY STREET,,,,MEDICAL FACILITY,,YES,NATURAL CAUSES/ILLNESS,,NOT APPLICABLE,,COMPLICATIONS OF CIRRHOSIS DUE TO CHRONIC ALCO...,"YES, RESULTS ARE AVAILABLE",YES,113.0,NOT APPLICABLE,,MOTION TO REVOKE PAROLE,,,PROBATION/PAROLE VIOLATION,2017-08-18 11:24:00,JAIL,HOSPITAL/INFIRMARY,1200 BAKER STREET,HOUSTON,HARRIS,HARRIS CO SHERIFFS OFFICE,77002.0,,,YES,YES,NO
2075,PA11145C,0.0,V_2005,2011-07-15 16:24:00,2011-06-21 00:03:00,STANLEY,WHITE,ROYCE,JR.,STANLEY ROYCE WHITE JR.,1983-01-11,28.0,M,BLACK,2011-06-21,2011-06-21 00:17:00,HARRIS,HOUSTON,10600 W. SAM HOUSTON PARKWAY SOUTH,,,,CRIME/ARREST SCENE,,NOT APPLICABLE,ACCIDENTAL,,"OTHER, SPECIFY",BLUN IMPACT TRAUMA,"BLUNT IMPACT TRAUMA OF TORSO, HEAD, AND EXTREM...","YES, RESULTS ARE AVAILABLE",NOT APPLICABLE,0.0,NOT APPLICABLE,,ARSON,,,NOT FILED AT TIME OF DEATH,NaT,POLICE CUSTODY (PRE-BOOKING),CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FL...,1200 TRAVIS,HOUSTON,HARRIS,HOUSTON POLICE DEPT,77002.0,,,,,
3591,PA14355P,0.0,V_2005,2014-08-05 14:02:00,2010-07-08 08:00:00,DAVID,HERRERA,,,DAVID HERRERA,1961-08-14,52.0,M,HISPANIC,2014-07-30,2014-07-30 17:47:00,GALVESTON,GALVESTON,809 HARBORSIDE DRIVE,,,,MEDICAL FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,CARDIOPULMONARY ARREST,NOT APPLICABLE,,CARDIOPULMONARY ARREST,"NO, EVALUATION NOT PLANNED",NOT APPLICABLE,1483.0,NOT APPLICABLE,,ENGAGING IN ORGANIZED CRIMINAL ACTIVITY,,,CONVICTED,2010-07-08 08:00:00,PRISON,TDCJ,"2503 LAKE ROAD, SUITE 5",HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77340.0,,,,,


In [88]:
show_notnull_chart(cdr)

Unnamed: 0,age_at_time_of_death,agency_address,agency_city,agency_county,agency_name,agency_zip,date_of_birth,date_time_of_custody_or_incident,days_from_custody_to_death,death_date,death_date_and_time,death_from_pre_existing_medical_condition,death_location_census_tract,death_location_city,death_location_county,death_location_latitude,death_location_longitude,death_location_street_address,death_location_type,death_location_type_other,entry_behavior,exhibit_any_medical_problems,exhibit_any_mental_health_problems,facility_entry_date_time,form_version,make_suicidal_statements,manner_of_death,manner_of_death_description,means_of_death,means_of_death_other,medical_cause_of_death,medical_examinor_coroner_evalution,medical_treatment,name_first,name_full,name_last,name_middle,name_suffix,num_revisions,offense_1,offense_2,offense_3,other_behavior,race,record_id,report_date,sex,specific_type_of_custody_facility,type_of_custody,were_the_charges,who_caused_death_in_homicide_or_accident,who_caused_death_in_homicide_or_accident_other
V_2005,1.0,1.0,1.0,1.0,1,1.0,1.0,1.0,0.999022,1,1,1.0,0,0.998826,1.0,0,0,0.999804,1.0,0.01937,0.141851,0.0,0.0,0.696537,1,0.0,1,0.591861,1.0,0.0829583,0.999413,1.0,0.999804,1,1,1.0,0.558403,0.0704363,1,0.998826,0.26081,0.0815887,0.0342399,1.0,1,1.0,1,1.0,1,1.0,1.0,0.00919585
V_2016,1.0,1.0,1.0,0.995169,1,1.0,1.0,1.0,1.0,1,1,1.0,0,0.999034,1.0,0,0,1.0,1.0,0.021256,0.0,0.872464,0.867633,0.77971,1,0.8657,1,0.157488,1.0,0.113043,0.996135,1.0,1.0,1,1,1.0,0.380676,0.047343,1,0.994203,0.2657,0.100483,0.0,1.0,1,0.999034,1,0.956522,1,1.0,1.0,0.0
V_BJS,0.993808,0.0201238,0.0201238,0.997291,1,0.0201238,0.934985,0.938854,0.938854,1,1,0.939628,0,0.940402,0.0201238,0,0,0.0201238,0.0189628,0.000773994,0.00116099,0.0,0.0,0.0189628,1,0.0,1,0.905573,0.0201238,0.00193498,0.0201238,0.940402,0.940402,1,1,0.997678,0.625774,0.00270898,0,0.940402,0.912926,0.903638,0.0,0.985294,1,0.940402,1,0.940402,1,0.940402,0.998065,0.0


In [89]:
print(len(cdr))

8730


In [90]:
cdr.form_version.value_counts()

V_2005    5111
V_BJS     2584
V_2016    1035
Name: form_version, dtype: int64

## Write

In [91]:
with dw.open_remote_file(DTW_PROJECT_KEY_CDR, CLEANED_FILENAME) as w:
    print("Writing to data.world:", CLEANED_FILENAME)
    cdr.to_csv(w, index=False)

Writing to data.world: cleaned_custodial_death_reports.csv


---
## TODOs -- problems that still exist in the data and need work
---

### [TODO] Some duplicates still exist, but it's unclear which to drop

In [93]:
dup_record_id = cdr.duplicated(subset='record_id', keep=False)  # Note keep=False to mark both in a duplicate pair
print("Not sure what to do with %d inconsistent duplicates" % dup_record_id.sum())
cdr[dup_record_id].sort_values(['record_id']).head(6)

Not sure what to do with 80 inconsistent duplicates


Unnamed: 0,record_id,num_revisions,form_version,report_date,date_time_of_custody_or_incident,name_first,name_last,name_middle,name_suffix,name_full,date_of_birth,age_at_time_of_death,sex,race,death_date,death_date_and_time,death_location_county,death_location_city,death_location_street_address,death_location_latitude,death_location_longitude,death_location_census_tract,death_location_type,death_location_type_other,death_from_pre_existing_medical_condition,manner_of_death,manner_of_death_description,means_of_death,means_of_death_other,medical_cause_of_death,medical_examinor_coroner_evalution,medical_treatment,days_from_custody_to_death,who_caused_death_in_homicide_or_accident,who_caused_death_in_homicide_or_accident_other,offense_1,offense_2,offense_3,were_the_charges,facility_entry_date_time,type_of_custody,specific_type_of_custody_facility,agency_address,agency_city,agency_county,agency_name,agency_zip,entry_behavior,other_behavior,exhibit_any_medical_problems,exhibit_any_mental_health_problems,make_suicidal_statements
6086,18-148-P,0.0,V_2016,2018-03-16 09:47:00,2017-12-07 00:00:00,BRUCE,DANNER,,,BRUCE DANNER,1971-10-10,46.0,M,WHITE,2018-02-26,2018-02-26 16:15:00,JEFFERSON,BEAUMONT,3295 FM 3514,,,,MEDICAL FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,,NOT APPLICABLE,,ANOXIC ENCEPHALOPATHY,"NO, EVALUATION NOT PLANNED",UNKNOWN,81.0,NOT APPLICABLE,,POSSESSION OF A CONTROLLED SUBSTANCE,,,CONVICTED,2017-12-07 00:00:00,PRISON,TDCJ,1012 VETERANS MEMORIAL PARKWAY,HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77320.0,,,YES,UNKNOWN,UNKNOWN
6085,18-148-P,0.0,V_2016,2018-03-16 09:40:00,2011-02-07 00:00:00,PEDRO,ZAMORA,,,PEDRO ZAMORA,1971-09-23,46.0,M,HISPANIC,2018-02-23,2018-02-23 19:02:00,LUBBOCK,LUBBOCK,8602 PEACH AVE,,,,MEDICAL FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,,NOT APPLICABLE,,PANCREATITIS /PNEUMONIA,"NO, EVALUATION NOT PLANNED",YES,2573.0,NOT APPLICABLE,,INDECENCY WITH A CHILD,,,CONVICTED,2011-02-07 00:00:00,PRISON,TDCJ,1012 VETERANS MEMORIAL PARKWAY,HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77320.0,,,YES,UNKNOWN,UNKNOWN
41,PA05042P,0.0,V_2005,2005-04-18 13:37:00,2004-09-24 10:00:00,ARTHUR,ETTER,,,ARTHUR ETTER,2005-05-05,-1.0,M,WHITE,2005-03-01,2005-03-01 20:24:00,GALVESTON,GALVESTON,BETO UNIT,,,,MEDICAL FACILITY,,DEVELOPED CONDITION AFTER ADMISSION,NATURAL CAUSES/ILLNESS,PNEUMONIA,NOT APPLICABLE,,RESPIRATORY FAILURE,"NO, EVALUATION NOT PLANNED",YES,158.0,NOT APPLICABLE,,CRIMINAL SOLICITATION OF A MINOR,,,CONVICTED,2004-09-24 10:00:00,PRISON,TDCJ,P.O. BOX 4003,HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,773424003.0,,,,,
6377,PA05042P,,V_BJS,2005-04-18 13:37:00,2004-09-24 10:00:00,ARTHUR,ETTER,,,ARTHUR ETTER,2023-03-02,82.0,M,WHITE,2005-03-01,2005-03-01 00:00:00,GALVESTON,GALVESTON,BETO UNIT,,,,MEDICAL FACILITY,,DEVELOPED CONDITION AFTER ADMISSION,NATURAL CAUSES/ILLNESS,PNEUMONIA,NOT APPLICABLE,,RESPIRATORY FAILURE,"NO, EVALUATION NOT PLANNED",YES,157.0,NOT APPLICABLE,,CRIMINAL SOLICITATION OF A MINOR,,,CONVICTED,9/24/2004,PRISON,TDCJ,P.O. BOX 4003,HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,773424000.0,,,,,
6552,PA05142CJ,,V_BJS,2005-08-11 14:08:00,2005-05-01 13:56:00,FRED,GREEN,,,FRED GREEN,1952-07-14,53.0,M,BLACK,2005-07-14,2005-07-14 00:00:00,DALLAS,DALLAS,133 N. INDUSTRIAL BLVD.,,,,MEDICAL FACILITY,,DEVELOPED CONDITION AFTER ADMISSION,NATURAL CAUSES/ILLNESS,BRAIN ANEURISM,NOT APPLICABLE,,RUPTURED BERRY ANEURYSM,"YES, RESULTS ARE AVAILABLE",NO,73.0,NOT APPLICABLE,,POSSESSION,PUBLIC INTOXICATION,,FILED,5/1/2005,JAIL,HOSPITAL/INFIRMARY,133 N. INDUSTRIAL BLVD.,DALLAS,DALLAS,DALLAS CO SHERIFFS OFFICE,75207.0,,,,,
141,PA05142CJ,0.0,V_2005,2005-08-11 14:08:00,2005-05-01 13:56:00,FRED,GREEN,,,FRED GREEN,2005-06-29,0.0,M,BLACK,2005-07-14,2005-07-14 23:25:00,DALLAS,DALLAS,133 N. INDUSTRIAL BLVD.,,,,MEDICAL FACILITY,,DEVELOPED CONDITION AFTER ADMISSION,NATURAL CAUSES/ILLNESS,BRAIN ANEURISM,NOT APPLICABLE,,RUPTURED BERRY ANEURYSM,"YES, RESULTS ARE AVAILABLE",NO,74.0,NOT APPLICABLE,,POSSESSION,PUBLIC INTOXICATION,,FILED,2005-05-01 13:56:00,JAIL,HOSPITAL/INFIRMARY,133 N. INDUSTRIAL BLVD.,DALLAS,DALLAS,DALLAS CO SHERIFFS OFFICE,75207.0,,,,,


### [TODO] `date_of_birth` and `age_at_time_of_death` can sometimes be impossible

In [94]:
cdr.age_at_time_of_death.value_counts().sort_index().loc[:10]

-25.0     1
-24.0     1
-21.0     3
-19.0     2
-17.0     1
-13.0     1
-9.0      1
-1.0      3
 0.0     14
 1.0      1
 6.0      1
 8.0      1
Name: age_at_time_of_death, dtype: int64

In [95]:
cdr.date_of_birth.dt.year.value_counts().sort_index().loc[2015:]

2015.0     4
2016.0     1
2018.0     2
2019.0     2
2020.0     2
2021.0     5
2022.0     5
2023.0     5
2024.0    11
2025.0     6
2026.0    19
2027.0    18
2028.0     9
2029.0    18
Name: date_of_birth, dtype: int64

### [TODO] The list of possible offenses is enourmous and needs condensing
#### Of ~4600 custodial deaths, there are 2000 unique `offense_1` values, e.g.:

In [96]:
print('Of %d custodial deaths, there are %d unique offense_1 values' % (len(cdr), len(set(cdr.offense_1))))
cdr.offense_1.value_counts().head(20)

Of 8730 custodial deaths, there are 2483 unique offense_1 values


MURDER                                     404
AGGRAVATED SEXUAL ASSAULT OF A CHILD       225
AGGRAVATED ROBBERY                         163
BURGLARY OF HABITATION                     157
DRIVING WHILE INTOXICATED 3RD OR MORE      154
HOMICIDE-MURDER                            153
ROBBERY                                    150
SEXUAL ASSLT/RAPE AGG/SEX ABUSE            138
AGG SEXUAL ASSLT CHILD                     137
DRIVING WHILE INTOXICATED                  133
INDECENCY W/ A CHILD SEXUAL CONTACT        133
AGGRAVATED SEXUAL ASSAULT                  121
SEXUAL ASSLT/RAPE OF CHILD                 119
AGGRAVATED ASSAULT WITH A DEADLY WEAPON    111
CAPITAL MURDER                             110
INDECENCY WITH A CHILD                     110
AGGRAVATED ASSAULT                         103
PUBLIC INTOXICATION                         95
BURGLARY OF A HABITATION                    87
POSSESSION OF A CONTROLLED SUBSTANCE        82
Name: offense_1, dtype: int64

#### Clearly some of these could be condensed. See the possibilities for offenses containing the word 'MURDER':

In [97]:
s = cdr.offense_1.dropna()
murder = s[s.str.contains('MURDER')]
N = 20
s = murder.value_counts()[:N]
s['(...%d other types of murder)' % (len(set(murder)) - N)] = murder.value_counts()[N:].sum()
s

MURDER                                          404
HOMICIDE-MURDER                                 153
CAPITAL MURDER                                  110
ATTEMPTED CAPITAL MURDER                         71
MURDER WITH A DEADLY WEAPON                      62
MURDER INTENTIONALLY CAUSE DEATH                 24
HOMICIDE-CAPITAL MURDER                          19
MURDER WITH DEADLY WEAPON                        18
ATTEMPTED CAPITAL MURDER OF A POLICE OFFICER     17
ATTEMPTED MURDER                                 12
ATTEMPTED CAPITAL MURDER OF A PEACE OFFICER      11
CRIMINAL ATTEMPT CAPITAL MURDER                  11
ATTEMPTED CAPITAL MURDER OF PEACE OFFICER         8
HOMICIDE-ATTEMPT/ASSLT TO COMMIT MURDER           8
MURDER WITH A DEADLY WPN                          8
CAPITAL MURDER IN COMM OF CITED OFFENSES          7
MURDER INTENDS SBI CAUSING DEATH                  6
ATTEMPTED CAPITAL MURDER OF POLICE OFFICER        6
ATT. CAPITAL MURDER                               6
MURDER WITH 

#### Something like this might extract a useful simplified category (murders and attempted murders)

In [98]:
def is_attempt(x):
    parts = x.split()
    parts = [''.join(ch for ch in p if ch.isalpha()) for p in parts]
    for w in ['ATT', 'ATTEMPT', 'ATTEMPTED']:
        if w in parts:
            return True
    return False

att = murder.apply(is_attempt)
att_murder = murder[att]
comm_murder = murder[~att]
set(att_murder)

{'ATT CAPITAL MURDER',
 'ATT MURDER',
 'ATT MURDER WITH A DEADLY WPN',
 'ATT. CAPITAL MURDER',
 'ATT. CAPITAL MURDER OF PEACE OFFICER',
 'ATTEMPT CAP MURDER PO',
 'ATTEMPT CAPITAL MURDER',
 'ATTEMPT CAPITAL MURDER OF A POLICE OFFICER',
 'ATTEMPT CAPITAL MURDER-POLICE OFFICER',
 'ATTEMPT CAPITAL MURDER/ PEACE OFFICER',
 'ATTEMPT CAPITAL MURDER/POLICE OFFICER',
 'ATTEMPT CAPITOL MURDER WITH A DEADLY WEAPON',
 'ATTEMPT TO COMMIT CAPITAL MURDER',
 'ATTEMPTED CAP. MURDER OF A PEACE OFFICER',
 'ATTEMPTED CAPITAL MURDER',
 'ATTEMPTED CAPITAL MURDER (PEACE OFFICER)',
 'ATTEMPTED CAPITAL MURDER - PEACE OFFICER',
 'ATTEMPTED CAPITAL MURDER - PEACE OFFICER/FIREARM',
 'ATTEMPTED CAPITAL MURDER - POLICE OFFICER',
 'ATTEMPTED CAPITAL MURDER 5 COUNTS',
 'ATTEMPTED CAPITAL MURDER DEADLY WEAPON',
 'ATTEMPTED CAPITAL MURDER OF A CHILD',
 'ATTEMPTED CAPITAL MURDER OF A PEACE OFFICER',
 'ATTEMPTED CAPITAL MURDER OF A POLICE OFFICER',
 'ATTEMPTED CAPITAL MURDER OF A POLICE OFFICER (X 3)',
 'ATTEMPTED CAPIT

### [TODO] Manner of death is sometimes 'OTHER' when it should not be
Two columns detail the manner of death: `manner_of_death` and `manner_of_death_description`. A handful of rows have `manner_of_death` = `OTHER` when they should have picked `ACCIDENT` or `HOMICIDE` -- evidenced by the information in the `manner_of_death_description` column:

In [99]:
frame = cdr[(cdr.manner_of_death == OTHER_SPECIFY) & (cdr.manner_of_death_description.notnull())]
for exempt in ('PENDING', 'DETERMIN'):
    frame = frame[~frame.manner_of_death_description.str.contains(exempt)]
print(len(frame), len(cdr))
frame[['form_version', 'manner_of_death_description', 'manner_of_death', 'medical_cause_of_death']]

81 8730


Unnamed: 0,form_version,manner_of_death_description,manner_of_death,medical_cause_of_death
3377,V_2005,ACCIDENT,"OTHER, SPECIFY",1.) ISCHEMIC HEART DISEASE WITH ACUTE HIP FRAC...
3372,V_2005,ACCIDENT,"OTHER, SPECIFY",INTRACEREBRAL HEMORRHAGE - METHAMPHETAMINE ABU...
3747,V_2005,ALCOHOL WITHDRAWAL,"OTHER, SPECIFY",COMPLICATIONS OF ETHANOL WITHDRAWAL DUE TO CHR...
3856,V_2005,MOTOR VEHICLE ACCIDENT,"OTHER, SPECIFY",BLUNT FORCE TRAUMA SUFFERED DURING A MOTOR VEH...
3574,V_2005,SELF INFLICTED INJUY,"OTHER, SPECIFY",PENDING AUTOSPY RESULTS
3682,V_2005,INVOLVED IN VEHICLE CRASH CAUSING DEATH,"OTHER, SPECIFY",MULTIPLE BLUNT FORCE INJURIES
3692,V_2005,CRASHED MOTOR VEHICLE,"OTHER, SPECIFY",BLUNT FORCE INJURIES OF THE HEAD AND NECK
3861,V_2005,MOTOR VEHICLE ACCIDENT,"OTHER, SPECIFY",MULTIPLE BLUNT FORCE INJURIES DUE TO MOTOR VEH...
2871,V_2005,STRUCK BY AUTO WHILE FLEEING ARREST,"OTHER, SPECIFY",MULTIPLE BLUNT FORCE INJURIES
2728,V_2005,COMBINATION - NATURAL AND INJURIES,"OTHER, SPECIFY",PULMONARY THROMBOEMBOLI DUE TO DEEP VENOUS THR...


### [TODO] (Related) Extract motor vehicle accidents into a unique type

In [100]:
# This starter code should get it going
frame = cdr[(cdr.manner_of_death != 'ACCIDENTAL')]
frame = frame[(frame.medical_cause_of_death.fillna('').str.contains('ACCIDENT')) |
              frame.manner_of_death_description.fillna('').str.contains('ACCIDENT')]
print(len(frame))
frame[['medical_cause_of_death', 'means_of_death', 'means_of_death_other', 'manner_of_death', 'manner_of_death_description']]

42


Unnamed: 0,medical_cause_of_death,means_of_death,means_of_death_other,manner_of_death,manner_of_death_description
3504,INTRACEREBRAL HEMORRHAGE (CEREBROVASCULAR ACCI...,NOT APPLICABLE,,NATURAL CAUSES/ILLNESS,HEMORRHAGE
3377,1.) ISCHEMIC HEART DISEASE WITH ACUTE HIP FRAC...,NOT APPLICABLE,,"OTHER, SPECIFY",ACCIDENT
3372,INTRACEREBRAL HEMORRHAGE - METHAMPHETAMINE ABU...,NOT APPLICABLE,,"OTHER, SPECIFY",ACCIDENT
3704,ACUTE BRONCHOPNEUMONIA DUE TO LEFT AND RIGHT S...,NOT APPLICABLE,,NATURAL CAUSES/ILLNESS,CEREBRIAL VASCULAR ACCIDENT
3856,BLUNT FORCE TRAUMA SUFFERED DURING A MOTOR VEH...,"OTHER, SPECIFY",BLUNT FORCE TRAUMA RECEIVED DURING A MOTOR VEH...,"OTHER, SPECIFY",MOTOR VEHICLE ACCIDENT
3861,MULTIPLE BLUNT FORCE INJURIES DUE TO MOTOR VEH...,"OTHER, SPECIFY",BLUNT FORCE TRAUMA RECEIVED DURING A MOTOR VEH...,"OTHER, SPECIFY",MOTOR VEHICLE ACCIDENT
2604,"SUBDURAL HEMATOMA, CEREBRAL NECROSIS, CEREBRAL...","OTHER, SPECIFY",UNKNOWN,"OTHER, SPECIFY","UNDETERMINED, BECAUSE WE DO NOT KNOW THE CIRCU..."
3176,"NO AUTOPSY PERFORMED, BODY TOO BADLY BURNED","OTHER, SPECIFY",UNKNOWN,"OTHER, SPECIFY",AUTO ACCIDENT
3175,"BLUNT FORCE INJURIES OF HEAD, NECK AND TORSO","OTHER, SPECIFY",AUTO ACCIDENT,"OTHER, SPECIFY",AUTO ACCIDENT
3031,CHOKING ON FOOD,NOT APPLICABLE,,"OTHER, SPECIFY",ACCIDENT


### [TODO] Fix up `medical_treatment`

In [101]:
pd.crosstab(cdr.medical_treatment, cdr.form_version).sort_values(['V_2005'],ascending=False)

form_version,V_2005,V_2016,V_BJS
medical_treatment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NOT APPLICABLE,2871,196,29
YES,1434,220,2334
NO,805,154,7
UNKNOWN,0,465,60


### [TODO] Remove `medical_examinor_coroner_evalution`? Not sure it's useful.

In [102]:
pd.crosstab(cdr.medical_examinor_coroner_evalution, cdr.form_version).sort_values(['V_2005'],ascending=False)

form_version,V_2005,V_2016,V_BJS
medical_examinor_coroner_evalution,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"YES, RESULTS ARE AVAILABLE",4216,329,40
"NO, EVALUATION NOT PLANNED",702,327,9
"YES, RESULTS PENDING",158,379,1
"NO, EVALUATION PENDING",35,0,2
DON'T KNOW,0,0,5
NO,0,0,1679
YES,0,0,694


### [TODO] Somehow blend the `entry_behavior`/`other_behavior` field from V_2005 with the `exhibit_any_medical_problems`, `exhibit_any_mental_health_problems` and `make_suicidal_statements` from V_2016

In [103]:
pd.crosstab(cdr.entry_behavior, cdr.form_version).sort_values(['V_2005'],ascending=False)

form_version,V_2005,V_BJS
entry_behavior,Unnamed: 1_level_1,Unnamed: 2_level_1
MEDICAL,348,1
INTOXICATED,192,1
MENTAL,82,1
MENTAL; MEDICAL,41,0
INTOXICATED; MEDICAL,29,0
INTOXICATED; MENTAL,19,0
INTOXICATED; MENTAL; MEDICAL,9,0
MEDICAL; NA,4,0
INTOXICATED; NA,1,0


In [104]:
pd.crosstab(cdr.exhibit_any_medical_problems, cdr.form_version).sort_values(['V_2016'],ascending=False)

form_version,V_2016
exhibit_any_medical_problems,Unnamed: 1_level_1
UNKNOWN,400
NO,260
YES,243


In [105]:
pd.crosstab(cdr.exhibit_any_mental_health_problems, cdr.form_version).sort_values(['V_2016'],ascending=False)

form_version,V_2016
exhibit_any_mental_health_problems,Unnamed: 1_level_1
UNKNOWN,524
NO,288
YES,86


In [106]:
pd.crosstab(cdr.make_suicidal_statements, cdr.form_version).sort_values(['V_2016'],ascending=False)

form_version,V_2016
make_suicidal_statements,Unnamed: 1_level_1
UNKNOWN,484
NO,345
YES,67


### [TODO] Align various binary behavioral columns from each form version, add to final output

In [107]:
# These columns were completely removed. Some can stay once paired off
drop_cols

{'Agency Phone',
 'Agency State',
 'Appear intoxicated (alcohol or drugs)',
 "Attempt gain possession officer's weapon",
 'Attempt to Injure Others?',
 'Barricade self or initiate standoff?',
 'Custody Type Facility',
 'Death Reason',
 'Death Weapon Other Description',
 'Decedent Display or Use Weapon Details',
 'Decedent display/use of weapons',
 'Director First Name',
 'Director Last Name',
 'Director Middle Name',
 'Director Salutation',
 'Escape or attempt to escape/flee custody',
 "Gain possession of officer's weapon",
 'Grab, hit or fight with the officer(s)',
 'Injured By',
 'Location Category',
 'Medical Treatment Description',
 'Other Agencies Respond?',
 'Other Location Category',
 'Other device, specify',
 'Other weapon, specify',
 'Physically attempt/assault officer(s)',
 'Reporter Email',
 'Reporter Name',
 'Reporter Name Original CDR',
 'Resist being handcuffed or arrested?',
 'Specify Weapon Used',
 'Status',
 'TDCJ - Specify Unit',
 'Threaten the officer(s) involved',
 