# Clean and reformat CDR data from its multi-tab excel file into a single csv

### About the data

CDR data is tricky -- the form used by law enforcement has changed over time, first in 2005, then again in 2016. The data before 2005 is known to be be sparse and poorly enforced, so we ignore those entries. The 2005 and 2016 versions of the form have some overlap and some differences, so we must be careful in how we merge them.

**In this repo you can find blank versions of the [2005](https://github.com/texas-justice-initiative/data-processing/blob/master/forms/CDR%20Form%20Version%202005.pdf) and [2016](https://github.com/texas-justice-initiative/data-processing/blob/master/forms/CDR%20Form%20Version%202016.pdf) forms, to see for yourself exactly what fields are collected and how.**

### Datasets used


* Input:
  * `tji/deaths-in-custody/CDR - All Reports.xlsx`
  * `tji/auxiliary-datasets/agencies_and_counties`
* Output:
  * `tji/deaths-in-custody/cleaned_custodial_death_reports.csv`
  
##### Author: Everett Wetchler (everett.wetchler@gmail.com)

## Setup and read data

In [1]:
DTW_PROJECT_KEY_CDR = 'tji/deaths-in-custody'
RAW_FILENAME = 'original/CDR Reports All.xlsx'
SHEETNAMES = ['Form Version 2005', 'Form Version 2016', 'Older Forms']
CLEANED_FILENAME = 'cleaned_custodial_death_reports.csv'

In [2]:
import datadotworld as dw
import json
import numpy as np
import pandas as pd

from lib.cleaning_tools import *

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

%load_ext watermark
%watermark -a "Everett Wetchler" -d -t -z -w -p numpy,pandas,datadotworld

Everett Wetchler 2018-05-25 15:56:22 CDT

numpy 1.14.3
pandas 0.22.0
datadotworld 1.6.0
watermark 1.6.0


In [3]:
from lib.standardize_police_agency_names import standardize_agency_name

In [4]:
datasets = dw.load_dataset('tji/auxiliary-datasets', force_update=True)
agency_county = datasets.dataframes['agencies_and_counties']
agency_county = agency_county.set_index('agency')['county'].to_dict()

In [5]:
dfs = dict(zip(SHEETNAMES, read_dtw_excel(DTW_PROJECT_KEY_CDR, RAW_FILENAME, sheet_names=SHEETNAMES)))
dfs['Form Version 2005']['form_version'] = 'V_2005'
dfs['Form Version 2016']['form_version'] = 'V_2016'
dfs['Older Forms']['form_version'] = 'V_OLDER'
cdr = pd.concat([dfs['Form Version 2005'], dfs['Form Version 2016']])
print('Read %d raw CDR records. Ignoring %d from older form versions (pre-2005).' % (
    sum(len(x) for x in dfs), len(dfs['Older Forms'])))
print('Keeping %d records using form version 2005, and %d using version 2016 (keeping %d in total)' % (
    len(dfs['Form Version 2005']), len(dfs['Form Version 2016']), len(cdr)))

Writing excel file to temp file: /var/folders/dc/8cbxbsh515s908xl0zyprszm0000gn/T/tmp7eej71j5
Read 45 raw CDR records. Ignoring 5468 from older form versions (pre-2005).
Keeping 5156 records using form version 2005, and 1115 using version 2016 (keeping 6271 in total)


### In theory, all these records should be for deaths in 2005 or later. Let's double check and drop any miscreants.

In [6]:
assert cdr['Death Date and Time'].isnull().sum() == 0

In [7]:
before = len(cdr)
cdr = cdr[cdr['Death Date and Time'].dt.year >= 2005]
after = len(cdr)
print('Dropped %d (of %d) reports for deaths before 2005, leaving %d' % (before - after, before, after))

Dropped 6 (of 6271) reports for deaths before 2005, leaving 6265


### A quick look at the raw data

In [8]:
dfs['Form Version 2005'].head()

Unnamed: 0,CDR: CDR Name,Report Date,Status,Version Type,Version Number,Department Type,Agency Name,Agency Phone,Agency Address,Agency City,Agency County,Agency State,Agency Zip,Director Salutation,Director First Name,Director Middle Name,Director Last Name,Reporter Name Original CDR,Reporter Email,Street Address,City,County,Longitude,Latitude,Census Tract,Type of Custody,Specific Type of Custody/Facility,Custody Type Facility,Entry Date Time,Entry Date Time N/A,Death Location,Death Location Elsewhere,First Name,Middle Name,Last Name,Suffix,Date of Birth,Sex,Ethnicity,Ethnicity Other,Death Date and Time,Age At Time Of Death,Date/Time of Custody or Incident,Custody Date NA,Medical Examinor/Coroner Evalution?,Manner of Death,Manner of Death Description,Death Reason,Medical Cause of Death,Medical Treatment,Medical Treatment Description,Who caused the death?,Death Causer Other,Type of Death Weapon,Death Weapon Other Description,Pre existing medical condition?,Means of Death,Means of Death Other,Offense 1,Offense 2,Offense 3,Were the Charges:,Type of Offense,"Type of Offense, Other",Injured By,Threaten the officer(s) involved,Resist being handcuffed or arrested?,Try to escape/flee from custody,"Grab, hit or fight with the officer(s)",Other Behavior,Specify Other Behavior,Appear intoxicated (alcohol or drugs),Use weapon threaten/assault officer(s),Entry Behavior,Under Restraint,Type of Restraint,"Other device, specify",form_version
0,PA05001C,2005-03-02 14:48:00,Submitted,ORIGINAL VERSION,1,POLICE,Fort Worth Police Dept.,8178778022,350 W. Belknap,Fort Worth,TARRANT,TX,76102,Chief,Ralph,,Mendoza,Renee Gray,renee.gray@oag.state.tx.us,1509 W. Hammond,Fort Worth,Tarrant,,,,Police Custody (pre-booking),Custody of Peace Officer during/fleeing arrest,,NaT,1,At the crime/arrest scene,,Dino,,Gomez,,1964-04-03,Male,Hispanic,,2005-01-01 15:04:00,40,2005-01-01 15:04:00,0,"Yes, results are available",Justifiable Homicide,,Injuries only,Gunshot Wound to the Chest,Not Applicable,,Law enforcement/correctional staff,,Handgun,,Not Applicable; cause of death was accidental ...,Firearm,,Aggravated Assault,,,Not filed at time of death,,,Injured by Officer,Yes,Yes,Yes,No,0,,No,1,,No,,,V_2005
1,PA05002CJ,2005-03-03 14:07:00,Submitted,ORIGINAL VERSION,1,SHERIFF,Travis County Sheriff's Dept.,5128549770,P. O. Box 1748,Austin,TRAVIS,TX,78767,Sheriff,Margo,L.,Frasier,Renee Gray,renee.gray@oag.state.tx.us,3614 Bill Price Road,Del Valle,Travis,,,,County Jail,Jail - multiple occupancy cell,,2004-10-20 20:56:00,0,At medical facility,,Michael,Darnell,Dickson,,1953-03-04,Male,African-American,,2005-01-01 15:59:00,51,2004-10-20 20:56:00,0,"Yes, results are available",Natural Causes/Illness,Cardiac Arrest,Medical condition only (e.g. heart attack),Myocardial infarction,Yes,"Treatment for hypertension, administered Proca...","Not applicable; cause of death was suicide, in...",,Not Applicable,,Pre-existing medical condition,Not applicable; cause of death was intoxicatio...,,Man/Del/Sell/Poss Controlled Substance,,,Filed,,,Injured by NA,No,No,No,No,0,,No,0,Medical,No,,,V_2005
2,PA05003P,2005-03-03 14:16:00,Submitted,ORIGINAL VERSION,1,STAGENCY,Texas Department Of Criminal Justice,9364376716,P.O. Box 4003,Huntsville,TRAVIS,TX,773424003,Mr.,Chris,H.,Stallings,Renee Gray,renee.gray@oag.state.tx.us,21 FM 247,Huntsville,Walker,,,,Penitentiary,TDCJ,Byrd,2004-12-17 09:00:00,0,At medical facility,,Darryl,Glenn,Wallace,,1960-12-03,Male,African-American,,2005-01-02 04:35:00,44,2004-12-17 09:00:00,0,"Yes, results are available",Natural Causes/Illness,Cardiac,Medical condition only (e.g. heart attack),Cardiac death secondary to cardiac ischemia th...,Yes,Nitroglycerin,"Not applicable; cause of death was suicide, in...",,Not Applicable,,Pre-existing medical condition,Not applicable; cause of death was intoxicatio...,,Burglary of Habitation,,,Convicted,,,Injured by NA,No,No,No,No,0,,No,0,,No,,,V_2005
3,PA05004C,2005-03-03 14:40:00,Submitted,ORIGINAL VERSION,1,POLICE,San Antonio Police Dept.,2102077449,"214 W. Nueva, Suite 331",San Antonio,BEXAR,TX,78207,Chief,Albert,,Ortiz,Renee Gray,renee.gray@oag.state.tx.us,2102 Goliad Road,San Antonio,Bexar,,,,Police Custody (pre-booking),Custody of Peace Officer during/fleeing arrest,,NaT,1,At the crime/arrest scene,,Albert,Chavez,Enriquez,,1983-11-28,Male,Hispanic,,2005-01-03 15:00:00,21,2005-01-03 14:50:00,0,"Yes, results are available",Justifiable Homicide,,Injuries only,Multiple Gunshot Wounds,Not Applicable,,Law enforcement/correctional staff,,Handgun,,Not Applicable; cause of death was accidental ...,Firearm,,Aggravated Robbery,,,Not filed at time of death,,,Injured by Officer,Yes,Yes,Yes,Yes,0,,No,1,,No,,,V_2005
4,PA05005C,2005-03-03 14:51:00,Submitted,ORIGINAL VERSION,1,POLICE,Houston Police Dept.,7133081778,"1200 Travis, 17th Floor",Houston,HARRIS,TX,77002,Chief,Clarence,O.,Bradford,Renee Gray,renee.gray@oag.state.tx.us,3600 block of Telephone Rd,Houston,Harris,,,,Police Custody (pre-booking),Custody of Peace Officer during/fleeing arrest,,NaT,1,At the crime/arrest scene,,Alex,,Mendez,,1978-12-30,Male,Hispanic,,2005-01-03 17:15:00,26,2005-01-03 17:15:00,0,"Yes, results are available",Justifiable Homicide,,Injuries only,Multiple Gunshot Wounds,Not Applicable,,Law enforcement/correctional staff,,Handgun,,Not Applicable; cause of death was accidental ...,Firearm,,Traffic Violation,,,Not filed at time of death,,Traffic Violation,Injured by Officer,Yes,No,No,No,0,,No,1,,No,,,V_2005


In [9]:
dfs['Form Version 2016'].head()

Unnamed: 0,CDR: CDR Name,Version Type,Version Number,Report Date,Status,Agency Name,Agency Address,Agency City,Agency State,Agency Zip,Director Salutation,Director First Name,Director Middle Name,Director Last Name,Reporter Name,Reporter Email,First Name,Middle Name,Last Name,Suffix,Date of Birth,Sex,Race,Age At Time Of Death,Date/Time of Custody or Incident,Death Date and Time,Medical Examinor/Coroner Evalution?,Manner of Death,Manner of Death Description,Medical Cause of Death,Medical Treatment,Who caused the death?,Type of weapon that caused death?,"Other weapon, specify",Pre existing medical condition?,Means of Death,Means of Death Other,Street Address,City,County,Zip,Longitude,Latitude,Census Tract,Location Category,Other Location Category,Type of Custody,Specific Type of Custody/Facility,TDCJ - Specify Unit,Entry Date Time,Entry Date Time N/A,Death Location,Death Location Elsewhere,Other Agencies Respond?,Offense 1,Offense 2,Offense 3,Were the Charges:,Type of Offense,"Type of Offense, Other",Decedent display/use of weapons,Decedent Display or Use Weapon Details,Specify Weapon Used,Attempt to Injure Others?,Appear intoxicated (alcohol or drugs),Make suicidal statements?,Exhibit any mental health problems?,Exhibit any medical problems?,Barricade self or initiate standoff?,Resist being handcuffed or arrested?,Physically attempt/assault officer(s),Gain possession of officer's weapon,Verbally threaten other(s) including law,Escape or attempt to escape/flee custody,Attempt gain possession officer's weapon,Under Restraint,Type of Restraint,"Other device, specify",form_version
0,16-3-C,AMENDED,4,2016-12-12 13:02:00,Submitted,Texas Department Of Public Safety,PO BOX 4087,Austin,TX,78773,Director,Steven,,McCraw,Joanne Scarbrough,joanne.scarbrough@dps.texas.gov,Ivory,Charles,Pantallion,III,1980-09-14,Male,Black or African American,36,2016-11-22 09:26:00,2016-11-22 09:26:00,"Yes, results are available",Homicide (includes Justifiable Homicide),,Multiple Gunshot Wounds,Not Applicable,Law enforcement/correctional personnel,Handgun; Rifle/shotgun,,Not Applicable; cause of death was accidental ...,Firearm,,7300 Interstate 10 W,Baytown,Harris,77521.0,,,,Roadway/highway/street/sidewalk,,Police Custody (pre-booking),Custody of Law Enforcement Personnel during/fl...,,NaT,1,Scene of incident,,Yes,Aggravated Assault on Peace Officer,Evading Arrest or Detention,,Not filed at time of death,Violent Crime Against Persons,,"Yes, mark all that apply",Displayed firearm without discharge,,Yes (select all that apply),No,No,No,No,Unknown,Yes,Yes,No,Unknown,Yes,No,No,,,V_2016
1,14-1-C,AMENDED,2,2016-12-13 16:20:00,Submitted,Texas Department Of Public Safety,PO BOX 4087,Austin,TX,78773,Director,Steven,,McCraw,Joanne Scarbrough,joanne.scarbrough@dps.texas.gov,James,Earl,Nicholas,,1966-04-24,Male,Anglo or White,48,2014-09-11 17:00:00,2014-09-11 17:00:00,"Yes, results are available",Could not be determined,,Multiple Gunshot Wounds,Not Applicable,Law enforcement/correctional personnel,"Firearm, unspecified",,Not Applicable; cause of death was accidental ...,Firearm,,2030 Jacintoport Blvd.,Houston,Harris,77015.0,,,,Roadway/highway/street/sidewalk,,Police Custody (pre-booking),Custody of Law Enforcement Personnel during/fl...,,NaT,1,Scene of incident,,Yes,Capital Murder - Filed,Aggravated Assault on a Peace Officer - Not Filed,Evading Arrest or Detention - Not Filed,Filed,Violent Crime Against Persons,,"Yes, mark all that apply",Discharged firearm,,Yes (select all that apply),No,No,No,No,No,Yes,Yes,No,Unknown,Yes,No,No,,,V_2016
2,16-4-P,ORIGINAL VERSION,1,2016-12-14 15:27:00,Submitted,TDCJ/Office of the Inspector General,"2503 Lake Road, Suite 5",Huntsville,TX,77340,Other,John,,West,Analou Sievers,analou.sievers@tdcj.texas.gov,Percy,,Froman,,1969-12-19,Male,Anglo or White,46,2000-01-28 00:00:00,2016-11-17 08:25:00,"No, evaluation not planned",Natural,,Colon Cancer,No,Not applicable,Not Applicable,,Pre-existing medical condition,"Not applicable, cause of death was illness/nat...",,8602 Peach Street,Lubbock,Lubbock,79404.0,,,,Law Enforcement Facility,,Penitentiary,"TDCJ, specify",Montford,2000-01-28 00:00:00,0,Medical facility,,No,Aggravated Robbery,,,Convicted,Violent Crime Against Persons,,No,,,No,,,,,,,,,,,,No,,,V_2016
3,16-5-C,ORIGINAL VERSION,1,2016-12-14 18:37:00,Submitted,San Antonio Police Dept.,315 S. Santa Rosa,San Antonio,TX,78207,Chief,William,,McManus,Leroy Carrion,leroy.carrion@sanantonio.gov,Andrew,,Moreno,,1991-12-23,Male,Hispanic or Latino,24,2016-11-19 21:00:00,2016-11-20 05:34:00,"Yes, results are available",Homicide (includes Justifiable Homicide),,Multiple gunshot wounds,No,Law enforcement/correctional personnel,Rifle/shotgun,,Not Applicable; cause of death was accidental ...,Firearm,,5814 Shadow Glen #4,San Antonio,Bexar,78238.0,,,,Residence/Home,,Police Custody (pre-booking),Custody of Law Enforcement Personnel during/fl...,,2016-11-19 21:00:00,0,Scene of incident,,Yes,Aggravated Assault with a Deadly Weapon,,,Filed,Violent Crime Against Persons,,"Yes, mark all that apply",Displayed firearm without discharge,,Yes (select all that apply),Unknown,Yes,No,No,Yes,No,Yes,No,Yes,No,No,No,,,V_2016
4,16-6-MJ,ORIGINAL VERSION,1,2016-12-15 11:20:00,Submitted,Rosenberg Police Dept.,2120 Fourth St.,Rosenberg,TX,77471,Chief,Dallis,,Warren,Charles Crocker,justin.crocker@rosenbergtx.gov,Roberto,Eduardo,Velasquez,,1977-01-31,Male,Hispanic or Latino,39,2016-12-02 22:32:00,2016-12-03 09:15:00,"Yes, results pending",Pending autopsy results,,Pending autopsy results.,No,Not applicable,Not Applicable,,Pre-existing medical condition,"Not applicable, cause of death was illness/nat...",,1910 Louise #41,Rosenberg,Fort Bend,77471.0,,,,Residence/Home,,Municipal Jail,Jail - detox cell,,2016-12-02 22:32:00,0,Medical facility,,No,TCIC warrant - Failure to Appear (Child Neglect),,,Filed,"Other, specify",Investigation of disturbance call at residence...,No,,,No,Yes,No,No,No,No,No,No,No,No,No,No,No,,,V_2016


## Helper methods

In [10]:
def show_notnull_chart(cdr, vertical=False):
    '''Show which columns have null values, how often, and break down by form version.'''
    notnull05 = cdr[cdr.form_version == 'V_2005'].notnull().mean()
    notnull16 = cdr[cdr.form_version == 'V_2016'].notnull().mean()
    frame = pd.concat([notnull05, notnull16], axis=1)
    frame.columns = ['V_2005 forms', 'V_2016 forms']
    frame.sort_index(inplace=True)
    if not vertical:
        frame = frame.T
    return frame.style.background_gradient(cmap='RdYlGn', axis=(0 if vertical else 1))

In [11]:
OTHER_SPECIFY = 'OTHER, SPECIFY'

-----
## Begin Cleaning
-----

### First, take a look at how often our columns are populated, broken down by form version (since each ask slightly different questions)

In [12]:
show_notnull_chart(cdr)

Unnamed: 0,Age At Time Of Death,Agency Address,Agency City,Agency County,Agency Name,Agency Phone,Agency State,Agency Zip,Appear intoxicated (alcohol or drugs),Attempt gain possession officer's weapon,Attempt to Injure Others?,Barricade self or initiate standoff?,CDR: CDR Name,Census Tract,City,County,Custody Date NA,Custody Type Facility,Date of Birth,Date/Time of Custody or Incident,Death Causer Other,Death Date and Time,Death Location,Death Location Elsewhere,Death Reason,Death Weapon Other Description,Decedent Display or Use Weapon Details,Decedent display/use of weapons,Department Type,Director First Name,Director Last Name,Director Middle Name,Director Salutation,Entry Behavior,Entry Date Time,Entry Date Time N/A,Escape or attempt to escape/flee custody,Ethnicity,Ethnicity Other,Exhibit any medical problems?,Exhibit any mental health problems?,First Name,Gain possession of officer's weapon,"Grab, hit or fight with the officer(s)",Injured By,Last Name,Latitude,Location Category,Longitude,Make suicidal statements?,Manner of Death,Manner of Death Description,Means of Death,Means of Death Other,Medical Cause of Death,Medical Examinor/Coroner Evalution?,Medical Treatment,Medical Treatment Description,Middle Name,Offense 1,Offense 2,Offense 3,Other Agencies Respond?,Other Behavior,Other Location Category,"Other device, specify","Other weapon, specify",Physically attempt/assault officer(s),Pre existing medical condition?,Race,Report Date,Reporter Email,Reporter Name,Reporter Name Original CDR,Resist being handcuffed or arrested?,Sex,Specific Type of Custody/Facility,Specify Other Behavior,Specify Weapon Used,Status,Street Address,Suffix,TDCJ - Specify Unit,Threaten the officer(s) involved,Try to escape/flee from custody,Type of Custody,Type of Death Weapon,Type of Offense,"Type of Offense, Other",Type of Restraint,Type of weapon that caused death?,Under Restraint,Use weapon threaten/assault officer(s),Verbally threaten other(s) including law,Version Number,Version Type,Were the Charges:,Who caused the death?,Zip,form_version
V_2005 forms,1,1,1,0.999612,1,1,0.993398,1,0.998447,0.0,0,0.0,1,0,0.998835,1,1,0.512427,1,1,0.00912621,1,1,0.0192233,0.999806,0.0333981,0.0,0,0.999806,1,1,0.202524,1,0.141748,0.693592,1,0.0,1,0.00427184,0.0,0.0,1,0.0,0.998641,0.999806,1,0,0,0,0.0,1,0.589126,1,0.0460194,0.999417,1,0.999806,0.279223,0.559806,0.998835,0.261553,0.0819417,0,1,0.0,0.0159223,0.0,0.0,1,0,1.0,1.0,0,0.999806,0.998641,1,1.0,0.0341748,0.0,1,0.999806,0.0706796,0.0,0.999417,0.999029,1,0.999417,0.00194175,0.149126,0.00038835,0.0,0.999806,1,0.0,1,1,1,1,0.0,1
V_2016 forms,1,1,1,0.0,1,0,1.0,1,0.875336,0.64843,1,0.644843,1,0,0.999103,1,0,0.0,1,1,0.0,1,1,0.0197309,0.0,0.0,0.164126,1,0.0,1,1,0.0125561,1,0.0,0.773991,1,0.643946,0,0.0,0.880717,0.875336,1,0.647534,0.0,0.0,1,0,1,0,0.874439,1,0.0107623,1,0.0286996,0.996413,1,1.0,0.0,0.373094,0.994619,0.266368,0.104036,1,0,0.0403587,0.00717489,0.032287,0.649327,1,1,0.999103,0.999103,1,0.0,0.636771,1,0.955157,0.0,0.0170404,1,1.0,0.0466368,0.66278,0.0,0.0,1,0.0,0.999103,0.119283,0.0547085,0.999103,1.0,0,0.650224,1,1,1,1,0.999103,1


### Start by dropping columns that we either don't need or haven't written code to clean yet.

In [13]:
keep_text = '''Both forms

- Age At Time Of Death
- Agency Address
- Agency City
- Agency Name
- Agency Zip
- CDR: CDR Name
- Census Tract
- City
- County
- Date of Birth
- Date/Time of Custody or Incident
- Death Date and Time
- Death Location
- Death Location Elsewhere
- Entry Date Time
- Entry Date Time N/A
- First Name
- Middle Name
- Last Name
- Latitude
- Longitude
- Suffix
- Manner of Death
- Manner of Death Description
- Means of Death
- Means of Death Other
- Medical Cause of Death
- Medical Examinor/Coroner Evalution?
- Medical Treatment
- Offense 1
- Offense 2
- Offense 3
- Pre existing medical condition?
- Report Date
- Sex
- Specific Type of Custody/Facility
- Street Address
- Type of Custody
- Type of Offense
- Type of Offense, Other
- Version Number
- Version Type
- Were the Charges:
- Who caused the death?
- form_version

2005 form only

- Agency County
- Custody Date NA
- Death Causer Other
- Department Type
- Entry Behavior
- Ethnicity
- Ethnicity Other
- Other Behavior
- Specify Other Behavior

2016 form only

- Exhibit any medical problems?
- Exhibit any mental health problems?
- Make suicidal statements?
- Race'''
keep_cols = []
drop_cols = set(cdr.columns)
for line in keep_text.splitlines():
    if line.startswith('- '):
        colname = line[2:]
        keep_cols.append(colname)
        drop_cols.remove(colname)

### Rename columns to be more machine friendly (lowercase, snake_case, and remove non-alphanumeric characters)

In [14]:
col_renames = {}
for c in keep_cols:
    new_name = ''.join([ch if ch.isalnum() else ' ' for ch in c.lower()])
    new_name = '_'.join(new_name.strip().split())
    col_renames[c] = new_name

In [15]:
cdr = cdr[list(col_renames.keys())]
cdr.rename(col_renames, inplace=True, axis=1)

### Upcase string cell contents

In [16]:
upcase_strip_string_cells(cdr)

## Standardize and merge race columns -- the 2005 form calls it 'ethnicity', the 2016 'race'

#### Have a look at the values first

In [17]:
cdr.race.value_counts()

ANGLO OR WHITE               478
HISPANIC OR LATINO           335
BLACK OR AFRICAN AMERICAN    295
OTHER                          5
ASIAN OR PACIFIC ISLANDER      2
Name: race, dtype: int64

In [18]:
cdr.ethnicity.value_counts()

ANGLO                               2115
AFRICAN-AMERICAN                    1521
HISPANIC                            1433
ASIAN                                 29
OTHER                                 22
AMERICAN INDIAN/ALASKA NATIVE         13
MIDDLE EAST                           12
NATIVE HAWAIIAN/PACIFIC ISLANDER       5
Name: ethnicity, dtype: int64

In [19]:
# When choosing the 'Other' ethnicity in the 2005, the form has a
# subsequent field to specify. Though clearly some of them are not
# truly 'other' ethnicities. See:
cdr.ethnicity_other.value_counts()

WHITE                  7
CAUCASIAN              4
UNKNOWN                4
UNITED KINGDOM         1
CUBAN                  1
SUDANESE BLACK         1
ARABIAN                1
WHITE NON HISPANIC     1
BELIZE, BRAZILIAN      1
ANGLO & MIDDLE EAST    1
Name: ethnicity_other, dtype: int64

In [20]:
# Let's make sure nobody is filling out the "other ethnicity" column when they shouldn't...
cdr[((cdr.ethnicity != 'OTHER') & cdr.ethnicity_other.notnull())][['ethnicity', 'ethnicity_other']]

Unnamed: 0,ethnicity,ethnicity_other


In [21]:
# Good. Let's transfer those specified ethnicity_other values into
# the 'ethnicity' column, so we can merge everything at once.
other_eth = (cdr.ethnicity == 'OTHER')
print('Merging %d "ethnicity_other" values into the main "ethnicity" column' % other_eth.sum())
cdr.loc[other_eth, 'ethnicity'] = cdr.ethnicity_other[other_eth]
cdr.drop('ethnicity_other', axis=1, inplace=True)

Merging 22 "ethnicity_other" values into the main "ethnicity" column


In [22]:
# Make a single 'race' column that has merged, simplified values of race or ethnicity.
race_eth_list = []
for race, eth in zip(cdr.race, cdr.ethnicity):
    # Only one of (race, eth) should be set
    assert pd.isnull(race) or pd.isnull(eth)
    if pd.isnull(race):
        if pd.isnull(eth):
            race_eth_list.append(None)
            continue
        x = eth
    else:
        x = race
    race_eth_list.append(x)

In [23]:
cdr['race'] = race_eth_list
cdr.drop('ethnicity', axis=1, inplace=True)
cdr.race.value_counts()

ANGLO                               2115
AFRICAN-AMERICAN                    1521
HISPANIC                            1433
ANGLO OR WHITE                       478
HISPANIC OR LATINO                   335
BLACK OR AFRICAN AMERICAN            295
ASIAN                                 29
AMERICAN INDIAN/ALASKA NATIVE         13
MIDDLE EAST                           12
WHITE                                  7
NATIVE HAWAIIAN/PACIFIC ISLANDER       5
OTHER                                  5
CAUCASIAN                              4
UNKNOWN                                4
ASIAN OR PACIFIC ISLANDER              2
BELIZE, BRAZILIAN                      1
WHITE NON HISPANIC                     1
ARABIAN                                1
SUDANESE BLACK                         1
CUBAN                                  1
UNITED KINGDOM                         1
ANGLO & MIDDLE EAST                    1
Name: race, dtype: int64

### Regularize race and several other columns

In [24]:
standardize_gender_cols(cdr)
standardize_race_cols(cdr)
numericalize_age_cols(cdr)

cdr.race.value_counts()

WHITE       2606
BLACK       1817
HISPANIC    1768
OTHER         74
Name: race, dtype: int64

## Standardize agency names, and add agency counties

In [25]:
# Standardize agency name (so we can join/compare across datasets)
cdr['agency_name'] = cdr['agency_name'].apply(standardize_agency_name)

# Lookup county name by agency name. If this fails, fall back
# on the county specified in the form, if it exists.
cdr['agency_county'] = cdr['agency_county'].str.upper()
county_lookup = cdr['agency_name'].apply(lambda name: agency_county.get(name, np.nan))
cdr['agency_county'] = county_lookup.fillna(cdr['agency_county'])

# Manually handle one major agency
cdr.loc[cdr['agency_name'] == 'TEXAS DEPT OF CRIMINAL JUSTICE', 'agency_county'] = 'STATE'

In [26]:
# Check that we are only missing counties for a paltry few records now.
cdr[cdr['agency_county'].isnull()]['agency_name'].value_counts()

TEXAS JUVENILE JUSTICE DEPT OFFICE OF INSPECTOR GENERAL     2
ELLIS CO SHERIFFS OFFICE WAYNE MCCOLLUM DETENTION CENTER    1
METROPOLITAN TRANSIT AUTH OF HARRIS CO                      1
ECTOR CO SHERIFFS OFC                                       1
Name: agency_name, dtype: int64

## FIx death-information related columns

#### The two form versions offer slightly different options(e.g. V_2005 uses 'AT MEDICAL FACILITY' while V_2016 uses 'MEDICAL FACILITY'). Collapse values to stabilize the options across form versions, and roll up rare values into a single 'OTHER' value.

### Fix: `death_location`

In [27]:
pd.crosstab(cdr.death_location, cdr.form_version)

form_version,V_2005,V_2016
death_location,Unnamed: 1_level_1,Unnamed: 2_level_1
AT LAW ENFORCEMENT FACILITY,1269,0
AT MEDICAL FACILITY,3053,0
AT THE CRIME/ARREST SCENE,673,0
DEAD ON ARRIVAL AT MEDICAL FACILITY,0,21
ELSEWHERE,97,0
"ELSEWHERE, SPECIFY",1,19
EN ROUTE TO BOOKING CENTER/POLICE LOCKUP,2,3
EN ROUTE TO MEDICAL FACILITY,54,0
LAW ENFORCEMENT FACILITY/BOOKING CENTER,0,184
MEDICAL FACILITY,0,743


In [28]:
replacements = {
    'AT MEDICAL FACILITY': 'MEDICAL FACILITY',
    'AT LAW ENFORCEMENT FACILITY': 'LAW ENFORCEMENT FACILITY',
    'AT THE CRIME/ARREST SCENE': 'CRIME/ARREST SCENE',
    'SCENE OF INCIDENT': 'CRIME/ARREST SCENE',
    'LAW ENFORCEMENT FACILITY/BOOKING CENTER': 'LAW ENFORCEMENT FACILITY',
    'DEAD ON ARRIVAL AT MEDICAL FACILITY': 'EN ROUTE TO MEDICAL FACILITY',
    'EN ROUTE TO BOOKING CENTER/POLICE LOCKUP': 'EN ROUTE TO LAW ENFORCEMENT FACILITY',
    'ELSEWHERE': OTHER_SPECIFY,
    'ELSEWHERE, SPECIFY': OTHER_SPECIFY,
}
cdr['death_location'] = cdr['death_location'].apply(lambda x: replacements.get(x.strip(), x))
pd.crosstab(cdr.death_location, cdr.form_version)

form_version,V_2005,V_2016
death_location,Unnamed: 1_level_1,Unnamed: 2_level_1
CRIME/ARREST SCENE,674,145
EN ROUTE TO LAW ENFORCEMENT FACILITY,2,3
EN ROUTE TO MEDICAL FACILITY,54,21
LAW ENFORCEMENT FACILITY,1269,184
MEDICAL FACILITY,3053,743
"OTHER, SPECIFY",98,19


### Fix: `means_of_death`

In [29]:
pd.crosstab(cdr.means_of_death, cdr.form_version)

form_version,V_2005,V_2016
means_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1
BATON / BLUNT INSTRUMENT,0,1
BLUNT INSTRUMENT,12,0
DON'T KNOW,108,0
DON\'T KNOW,32,0
DRUG OVERDOSE,151,26
FIREARM,892,181
"HANGING, STRANGULATION",545,80
KNIFE / EDGED INSTRUMENT,0,3
"KNIFE, CUTTING INSTRUMENT",38,0
"NOT APPLICABLE, CAUSE OF DEATH WAS ILLNESS/NATURAL CAUSE",2,711


In [30]:
replacements = {
    'NOT APPLICABLE, CAUSE OF DEATH WAS ILLNESS/NATURAL CAUSE': 'NOT APPLICABLE',
    'NOT APPLICABLE; CAUSE OF DEATH WAS INTOXICATION OR ILLNESS/NATURAL CAUSES': 'NOT APPLICABLE',
    'OTHER': OTHER_SPECIFY,
    'KNIFE, CUTTING INSTRUMENT': 'KNIFE / EDGED INSTRUMENT',
    'BLUNT INSTRUMENT': 'BATON / BLUNT INSTRUMENT',
    "DON'T KNOW": 'UNKNOWN',
    "DON\\'T KNOW": 'UNKNOWN',
}
cdr['means_of_death'] = cdr['means_of_death'].apply(lambda x: replacements.get(x.strip(), x))
pd.crosstab(cdr.means_of_death, cdr.form_version)

form_version,V_2005,V_2016
means_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1
BATON / BLUNT INSTRUMENT,12,1
DRUG OVERDOSE,151,26
FIREARM,892,181
"HANGING, STRANGULATION",545,80
KNIFE / EDGED INSTRUMENT,38,3
NOT APPLICABLE,3135,711
"OTHER, SPECIFY",236,27
UNKNOWN,140,79
VEHICLE ACCIDENT,1,7


### It seems officers sometimes misuse the 'other' option, selecting it when another category is more appropriate:

In [31]:
cdr[cdr.means_of_death == OTHER_SPECIFY]['means_of_death_other'].value_counts().head()

MOTOR VEHICLE ACCIDENT                                         8
BLUNT FORCE TRAUMA RECEIVED DURING A MOTOR VEHICLE ACCIDENT    8
DROWNING                                                       6
VEHICLE                                                        6
FALL                                                           6
Name: means_of_death_other, dtype: int64

### Let's roll all rare and 'unknown'-type values into the 'other' category. 'VEHICLE ACCIDENT' in particular is new in the 2016 form, only indicated as 'other' in older forms.

In [32]:
other_values = ['UNKNOWN', 'VEHICLE ACCIDENT', 'KNIFE / EDGED INSTRUMENT', 'BATON / BLUNT INSTRUMENT']
indices = cdr['means_of_death'].isin(other_values)
cdr.loc[indices, 'means_of_death_other'] = cdr.loc[indices, 'means_of_death']
cdr.loc[indices, 'means_of_death'] = OTHER_SPECIFY

pd.crosstab(cdr.means_of_death, cdr.form_version)

form_version,V_2005,V_2016
means_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1
DRUG OVERDOSE,151,26
FIREARM,892,181
"HANGING, STRANGULATION",545,80
NOT APPLICABLE,3135,711
"OTHER, SPECIFY",427,117


### Fix: `manner_of_death` and  `manner_of_death_description`

In [33]:
pd.crosstab(cdr.form_version, cdr.manner_of_death).T

form_version,V_2005,V_2016
manner_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1
ACCIDENTAL,3,33
ACCIDENTAL INJURY CAUSED BY OTHERS,20,0
ACCIDENTAL INJURY TO SELF,157,0
ALCOHOL/DRUG INTOXICATION,291,7
COULD NOT BE DETERMINED,3,7
HOMICIDE (INCLUDES JUSTIFIABLE HOMICIDE),15,142
JUSTIFIABLE HOMICIDE,679,0
NATURAL,18,641
NATURAL CAUSES/ILLNESS,2867,0
OTHER,136,0


In [34]:
pd.crosstab(cdr.manner_of_death, cdr.manner_of_death_description.notnull())

manner_of_death_description,False,True
manner_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1
ACCIDENTAL,33,3
ACCIDENTAL INJURY CAUSED BY OTHERS,20,0
ACCIDENTAL INJURY TO SELF,157,0
ALCOHOL/DRUG INTOXICATION,298,0
COULD NOT BE DETERMINED,9,1
HOMICIDE (INCLUDES JUSTIFIABLE HOMICIDE),156,1
JUSTIFIABLE HOMICIDE,679,0
NATURAL,642,17
NATURAL CAUSES/ILLNESS,1,2866
OTHER,0,136


In [35]:
replacements = {
    'NATURAL': 'NATURAL CAUSES/ILLNESS',
    'JUSTIFIABLE HOMICIDE': 'HOMICIDE',
    'OTHER HOMICIDE': 'HOMICIDE',
    'HOMICIDE (INCLUDES JUSTIFIABLE HOMICIDE)': 'HOMICIDE',
    'ACCIDENTAL INJURY CAUSED BY OTHERS': 'ACCIDENTAL',
    'ACCIDENTAL INJURY TO SELF': 'ACCIDENTAL',
    'OTHER': OTHER_SPECIFY,
    'OTHER - SPECIFY': OTHER_SPECIFY,
}
cdr['manner_of_death'] = cdr['manner_of_death'].apply(lambda x: replacements.get(x.strip(), x))

# In past versions, "pending autopsy results" was not an option, and reports had "other"
# checked with some mention of pending autopsy in the free field. We emulate this here
# to preserve consistency across form versions.
other_values = ['PENDING AUTOPSY RESULTS', 'COULD NOT BE DETERMINED']
indices = cdr['manner_of_death'].isin(other_values)
cdr.loc[indices, 'manner_of_death_description'] = cdr.loc[indices, 'manner_of_death']
cdr.loc[indices, 'manner_of_death'] = OTHER_SPECIFY

pd.crosstab(cdr.form_version, cdr.manner_of_death).T

form_version,V_2005,V_2016
manner_of_death,Unnamed: 1_level_1,Unnamed: 2_level_1
ACCIDENTAL,180,33
ALCOHOL/DRUG INTOXICATION,291,7
HOMICIDE,812,142
NATURAL CAUSES/ILLNESS,2885,641
"OTHER, SPECIFY",141,177
SUICIDE,841,115


### A few suicides by hanging have the wrong `means_of_death`

In [36]:
frame = cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death != 'HANGING, STRANGULATION')]
frame = frame[(frame.medical_cause_of_death.fillna('').str.contains('HANGING')) |
              frame.manner_of_death_description.fillna('').str.contains('HANGING')]
print(len(frame))
frame[['medical_cause_of_death', 'means_of_death', 'means_of_death_other', 'manner_of_death', 'manner_of_death_description']]

41


Unnamed: 0,medical_cause_of_death,means_of_death,means_of_death_other,manner_of_death,manner_of_death_description
992,PROVISIONAL DEATH IS ASPHYXIATION BY HANGING.,NOT APPLICABLE,,SUICIDE,
1058,SUICIDE BY HANGING,NOT APPLICABLE,,SUICIDE,
1082,ASPHYXIATION DUE TO STRANGULATION BY HANGING,NOT APPLICABLE,,SUICIDE,
1313,ASPHYXIATION DUE TO STRANGULATION BY HANGING,NOT APPLICABLE,,SUICIDE,
1332,HANGING,NOT APPLICABLE,,SUICIDE,
1374,MASSIVE ACUTE ANOXIC BRAIN INJURY WITH THE UND...,NOT APPLICABLE,,SUICIDE,
1479,ASHYXIATION DUE TO HANGING,NOT APPLICABLE,,SUICIDE,
1489,HANGING,NOT APPLICABLE,,SUICIDE,
1543,ASPHYXIATION BY HANGING DUE TO SUSPENDED SELF ...,NOT APPLICABLE,,SUICIDE,
1554,MEDICAL CAUSE OF DEATH WAS HANGING,NOT APPLICABLE,,SUICIDE,


In [37]:
cdr.loc[frame.index, 'means_of_death'] = 'HANGING, STRANGULATION'

# Be sure we got them all
frame = cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death != 'HANGING, STRANGULATION')]
frame = frame[frame.medical_cause_of_death.fillna('').str.contains('HANGING')]
print(len(frame))

0


### For a few suicides, the `manner_of_death` is 'NOT APPLICABLE', which makes no sense. We need to change these to OTHER, though (as shown below) more is going on

In [38]:
frame = cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death == 'NOT APPLICABLE')]
print(len(frame))
frame[['means_of_death', 'means_of_death_other', 'manner_of_death', 'manner_of_death_description', 'medical_cause_of_death']]

19


Unnamed: 0,means_of_death,means_of_death_other,manner_of_death,manner_of_death_description,medical_cause_of_death
825,NOT APPLICABLE,,SUICIDE,,SUDDEN CARDIAC DEATH SECONDARY TO EXSANGUINATI...
1022,NOT APPLICABLE,,SUICIDE,,INCISED WOUNDS OF BOTH WRISTS AND RIGHT ARM
1094,NOT APPLICABLE,,SUICIDE,,MORPHINE AND CHLORPHENIRAMINE OVERDOSE
1292,NOT APPLICABLE,,SUICIDE,,TOXIC EFFECT OF LETHAL LEVELS OF QUETIAPINE
1347,NOT APPLICABLE,,SUICIDE,,NORTRIPTYLINE AND SERTRALINE TOXICITY
1354,NOT APPLICABLE,,SUICIDE,,ACUTE ACETOAMINOPHEN OVERDOSE WITH HEMORRHAGIC...
1477,NOT APPLICABLE,,SUICIDE,,BLUNT FORCE TRAUMA OF HEAD DUE TO JUMP FROM HE...
1488,NOT APPLICABLE,,SUICIDE,,MULTIPLE DRUG INTOXICATION
1608,NOT APPLICABLE,,SUICIDE,,BETA BLOCKER TOXICITY
1619,NOT APPLICABLE,,SUICIDE,,ASPHYXIATION


In [39]:
cdr.loc[frame.index, 'means_of_death'] = OTHER_SPECIFY
len(cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death == 'NOT APPLICABLE')])

0

### Fix: `pre_existing_medical_condition` 

In [40]:
pd.crosstab(cdr.form_version, cdr.pre_existing_medical_condition).T

form_version,V_2005,V_2016
pre_existing_medical_condition,Unnamed: 1_level_1,Unnamed: 2_level_1
COULD NOT BE DETERMINED,0,613
DECEASED DEVELOPED CONDITION AFTER ADMISSION,0,54
DEVELOPED CONDITION AFTER ADMISSION,155,0
DON'T KNOW,1147,0
DON\'T KNOW,836,0
"NOT APPLICABLE; CAUSE OF DEATH WAS ACCIDENTAL INJURY, INTOXICATION, SUICIDE OR HOMICIDE",2072,329
PRE-EXISTING MEDICAL CONDITION,940,119


In [41]:
replacements = {
    'DECEASED DEVELOPED CONDITION AFTER ADMISSION': 'DEVELOPED CONDITION AFTER ADMISSION',
    "DON'T KNOW": 'UNKNOWN',
    "DON\\'T KNOW": 'UNKNOWN',
    'NOT APPLICABLE; CAUSE OF DEATH WAS ACCIDENTAL INJURY, INTOXICATION, SUICIDE OR HOMICIDE': 'NOT APPLICABLE',
    'COULD NOT BE DETERMINED': 'UNKNOWN',
    'PRE-EXISTING MEDICAL CONDITION': 'YES',
}
cdr['pre_existing_medical_condition'] = cdr['pre_existing_medical_condition'].apply(lambda x: replacements.get(x.strip(), x))
pd.crosstab(cdr.form_version, cdr.pre_existing_medical_condition).T

form_version,V_2005,V_2016
pre_existing_medical_condition,Unnamed: 1_level_1,Unnamed: 2_level_1
DEVELOPED CONDITION AFTER ADMISSION,155,54
NOT APPLICABLE,2072,329
UNKNOWN,1983,613
YES,940,119


### Fix: `who_caused_the_death` 

NOTE: This question is framed as follows:
* 2005 form: "If the death was an accident or homicide, who caused the death?"
* 2016 form: "If the death was an accident, homicide **or suicide**, who caused the death?" (emphasis added)

Thus, we need to:
1. Collapse near-identical values from different forms, similar to the other areas here.
1. Remove suicides from the 2016 responses, as they skew the data (see below). While we're add it, change ANY entries that are not of type homicide/suicide to have "NOT APPLICABLE" as the value.

#### Collapse near-identical values from different forms, similar to the other areas here.

In [42]:
pd.crosstab(cdr.who_caused_the_death, cdr.death_date_and_time.dt.year)

death_date_and_time,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
who_caused_the_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
DECEASED,11,14,14,13,12,14,5,13,19,12,13,16,0,0
DECEDENT,0,0,0,0,0,0,0,0,0,0,0,8,112,40
DON'T KNOW,3,4,6,14,23,24,21,11,8,5,0,0,0,0
DON\'T KNOW,0,0,0,0,0,0,0,0,0,2,18,14,0,0
LAW ENFORCEMENT/CORRECTIONAL PERSONNEL,0,0,0,0,0,0,0,0,1,2,2,14,80,49
LAW ENFORCEMENT/CORRECTIONAL STAFF,41,41,56,50,57,51,42,73,72,75,100,72,0,0
NOT APPLICABLE,0,0,0,0,0,0,0,0,0,0,0,60,504,217
"NOT APPLICABLE; CAUSE OF DEATH WAS SUICIDE, INTOXICATION OR ILLNESS/NATURAL CAUSES",232,262,286,254,224,214,228,238,563,521,552,441,0,0
OTHER CIVILIAN(S),0,0,0,0,0,0,0,0,0,0,0,0,1,0
OTHER DETAINEE(S),0,0,0,0,0,0,0,0,0,0,0,2,5,1


In [43]:
replacements = {
    'DECEASED': 'DECEDENT',
    "DON'T KNOW": 'UNKNOWN',
    "DON\\'T KNOW": 'UNKNOWN',
    'LAW ENFORCEMENT/CORRECTIONAL STAFF': 'LAW ENFORCEMENT/CORRECTIONAL PERSONNEL',
    'NOT APPLICABLE; CAUSE OF DEATH WAS SUICIDE, INTOXICATION OR ILLNESS/NATURAL CAUSES': 'NOT APPLICABLE',
    'OTHER DETAINEES': 'OTHER DETAINEE(S)',
    'OTHER PERSONS': 'OTHER CIVILIAN(S)',
    'ACCIDENTAL INJURY TO SELF': 'ACCIDENTAL',
    'UNKNOWN PERSON(S) CAUSED THE INJURY': 'UNKNOWN',
    'UNKNOWN WHETHER DECEDENT SUSTAINED A FATAL INJURY': 'UNKNOWN',
}
cdr['who_caused_the_death'] = cdr['who_caused_the_death'].apply(lambda x: replacements.get(x.strip(), x))
pd.crosstab(cdr['who_caused_the_death'], cdr.form_version)

form_version,V_2005,V_2016
who_caused_the_death,Unnamed: 1_level_1,Unnamed: 2_level_1
DECEDENT,156,160
LAW ENFORCEMENT/CORRECTIONAL PERSONNEL,731,147
NOT APPLICABLE,4015,781
OTHER CIVILIAN(S),46,1
OTHER DETAINEE(S),49,8
UNKNOWN,153,18


#### Remove suicides from the 2016 responses, as they skew the data (see below)

In [44]:
pd.crosstab(cdr.who_caused_the_death, cdr.manner_of_death)

manner_of_death,ACCIDENTAL,ALCOHOL/DRUG INTOXICATION,HOMICIDE,NATURAL CAUSES/ILLNESS,"OTHER, SPECIFY",SUICIDE
who_caused_the_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
DECEDENT,115,19,4,4,34,140
LAW ENFORCEMENT/CORRECTIONAL PERSONNEL,2,2,839,0,35,0
NOT APPLICABLE,57,253,23,3471,180,812
OTHER CIVILIAN(S),11,1,20,2,12,1
OTHER DETAINEE(S),5,0,49,0,3,0
UNKNOWN,23,23,19,49,54,3


In [45]:
cdr.loc[~cdr.manner_of_death.isin(['HOMICIDE', 'ACCIDENTAL', OTHER_SPECIFY]), 'who_caused_the_death'] = 'NOT APPLICABLE'
pd.crosstab(cdr.who_caused_the_death, cdr.manner_of_death)

manner_of_death,ACCIDENTAL,ALCOHOL/DRUG INTOXICATION,HOMICIDE,NATURAL CAUSES/ILLNESS,"OTHER, SPECIFY",SUICIDE
who_caused_the_death,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
DECEDENT,115,0,4,0,34,0
LAW ENFORCEMENT/CORRECTIONAL PERSONNEL,2,0,839,0,35,0
NOT APPLICABLE,57,298,23,3526,180,956
OTHER CIVILIAN(S),11,0,20,0,12,0
OTHER DETAINEE(S),5,0,49,0,3,0
UNKNOWN,23,0,19,0,54,0


### Fix: `were_the_charges` 

In [46]:
pd.crosstab(cdr['were_the_charges'], cdr.form_version)

form_version,V_2005,V_2016
were_the_charges,Unnamed: 1_level_1,Unnamed: 2_level_1
A PROBATION/PAROLE VIOLATION,0,12
CONVICTED,2852,759
FILED,840,96
NOT FILED AT TIME OF DEATH,1347,248
PROBATION/PAROLE,111,0


In [47]:
replacements = {
    'PROBATION/PAROLE': 'PROBATION/PAROLE VIOLATION',
    'A PROBATION/PAROLE VIOLATION': 'PROBATION/PAROLE VIOLATION',
}
cdr['were_the_charges'] = cdr['were_the_charges'].apply(lambda x: replacements.get(x.strip(), x))
pd.crosstab(cdr['were_the_charges'], cdr.form_version)

form_version,V_2005,V_2016
were_the_charges,Unnamed: 1_level_1,Unnamed: 2_level_1
CONVICTED,2852,759
FILED,840,96
NOT FILED AT TIME OF DEATH,1347,248
PROBATION/PAROLE VIOLATION,111,12


### Fix `type_of_custody`

In [48]:
pd.crosstab(cdr['type_of_custody'], cdr.form_version)

form_version,V_2005,V_2016
type_of_custody,Unnamed: 1_level_1,Unnamed: 2_level_1
COUNTY JAIL,1041,117
MUNICIPAL JAIL,143,15
PENITENTIARY,2654,741
POLICE CUSTODY (PRE-BOOKING),1296,114
PRE-CUSTODIAL USE OF FORCE,0,111
PRIVATE CORRECTIONAL FACILITY,16,0
PRIVATE FACILITY,0,17


In [49]:
replacements = {
    'PRE-CUSTODIAL USE OF FORCE': 'POLICE CUSTODY (PRE-BOOKING)',
    'PRIVATE CORRECTIONAL FACILITY': 'PRIVATE FACILITY',
    'COUNTY JAIL': 'JAIL',
    'MUNICIPAL JAIL': 'JAIL',
    'PENITENTIARY': 'PRISON',
}
cdr['type_of_custody'] = cdr['type_of_custody'].apply(lambda x: replacements.get(x.strip(), x))
pd.crosstab(cdr['type_of_custody'], cdr.form_version)

form_version,V_2005,V_2016
type_of_custody,Unnamed: 1_level_1,Unnamed: 2_level_1
JAIL,1184,132
POLICE CUSTODY (PRE-BOOKING),1296,225
PRISON,2654,741
PRIVATE FACILITY,16,17


### Fix `specific_type_of_custody_facility`

In [50]:
pd.crosstab(cdr.specific_type_of_custody_facility, cdr.form_version)

form_version,V_2005,V_2016
specific_type_of_custody_facility,Unnamed: 1_level_1,Unnamed: 2_level_1
CORRECTIONAL/REHABILITATION FACILITY,33,1
CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FLEEING ARREST,6,132
CUSTODY OF LAW ENFORCEMENT PERSONNEL SUBSEQUENT TO ARREST,2,41
CUSTODY OF PEACE OFFICER DURING/FLEEING ARREST,954,0
CUSTODY OF PEACE OFFICER SUBSEQUENT TO ARREST,246,0
HALFWAY HOUSE/RESTITUTION CENTER,12,0
HOSPITAL/INFIRMARY,307,46
JAIL - DAY ROOM/RECREATION AREA,25,4
JAIL - DETOX CELL,60,5
JAIL - HOLDING CELL,124,16


In [51]:
replacements = {
    'TDCJ, SPECIFY': 'TDCJ',
    'CUSTODY OF PEACE OFFICER DURING/FLEEING ARREST': 'CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FLEEING ARREST',
    'CUSTODY OF PEACE OFFICER SUBSEQUENT TO ARREST': 'CUSTODY OF LAW ENFORCEMENT PERSONNEL AFTER ARREST',
    'CUSTODY OF LAW ENFORCEMENT PERSONNEL SUBSEQUENT TO ARREST': 'CUSTODY OF LAW ENFORCEMENT PERSONNEL AFTER ARREST',
    'TEXAS-JUVENILE JUSTICE DEPARTMENT - FACILITY/DETENTION CENTER, SPECIFY': 'OTHER',
    'TJPC': 'OTHER',
    'TYC': 'OTHER',
    'HALFWAY HOUSE/RESTITUTION CENTER': 'OTHER',
    'CORRECTIONAL/REHABILITATION FACILITY': 'OTHER',
    'NON-LAW ENFORCEMENT DETOX FACILITY': 'OTHER',
}
cdr['specific_type_of_custody_facility'] = cdr['specific_type_of_custody_facility'].apply(
    lambda x: x if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['specific_type_of_custody_facility'], cdr.form_version)

form_version,V_2005,V_2016
specific_type_of_custody_facility,Unnamed: 1_level_1,Unnamed: 2_level_1
CUSTODY OF LAW ENFORCEMENT PERSONNEL AFTER ARREST,248,41
CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FLEEING ARREST,960,132
HOSPITAL/INFIRMARY,307,46
JAIL - DAY ROOM/RECREATION AREA,25,4
JAIL - DETOX CELL,60,5
JAIL - HOLDING CELL,124,16
JAIL - MULTIPLE OCCUPANCY CELL,383,33
JAIL - SINGLE CELL,360,46
OTHER,51,3
TDCJ,2632,739


### Glance at the types of custody crossed with facility

In [52]:
pd.crosstab(cdr.specific_type_of_custody_facility, cdr.type_of_custody)

type_of_custody,JAIL,POLICE CUSTODY (PRE-BOOKING),PRISON,PRIVATE FACILITY
specific_type_of_custody_facility,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CUSTODY OF LAW ENFORCEMENT PERSONNEL AFTER ARREST,15,274,0,0
CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FLEEING ARREST,3,1089,0,0
HOSPITAL/INFIRMARY,239,96,3,15
JAIL - DAY ROOM/RECREATION AREA,29,0,0,0
JAIL - DETOX CELL,64,1,0,0
JAIL - HOLDING CELL,135,5,0,0
JAIL - MULTIPLE OCCUPANCY CELL,413,1,0,2
JAIL - SINGLE CELL,399,0,4,3
OTHER,17,3,22,12
TDCJ,2,2,3366,1


#### Identify and drop a range of unnecessary columns

In [53]:
cdr.groupby([cdr.entry_date_time.isnull(), cdr.entry_date_time_n_a]).size().unstack()

entry_date_time_n_a,0,1
entry_date_time,Unnamed: 1_level_1,Unnamed: 2_level_1
False,4435.0,
True,,1830.0


In [54]:
pd.crosstab(cdr.custody_date_na, cdr['date_time_of_custody_or_incident'].isnull())

date_time_of_custody_or_incident,False
custody_date_na,Unnamed: 1_level_1
0.0,5149
1.0,1


In [55]:
cdr.drop(['entry_date_time_n_a', 'custody_date_na'], axis=1, inplace=True)

In [56]:
pd.crosstab(cdr.type_of_offense.notnull(), cdr.form_version)

form_version,V_2005,V_2016
type_of_offense,Unnamed: 1_level_1,Unnamed: 2_level_1
False,5140,1
True,10,1114


In [57]:
pd.crosstab(cdr.type_of_offense_other.notnull(), cdr.form_version)

form_version,V_2005,V_2016
type_of_offense_other,Unnamed: 1_level_1,Unnamed: 2_level_1
False,4382,982
True,768,133


In [58]:
cdr.drop(['type_of_offense', 'type_of_offense_other'], axis=1, inplace=True)

In [59]:
pd.crosstab(cdr.department_type.notnull(), cdr.form_version)

form_version,V_2005,V_2016
department_type,Unnamed: 1_level_1,Unnamed: 2_level_1
False,1,1115
True,5149,0


In [60]:
cdr.drop('department_type', axis=1, inplace=True)

In [61]:
cdr['other_behavior'].value_counts()

0.0    4974
1.0     176
Name: other_behavior, dtype: int64

In [62]:
pd.crosstab(cdr['other_behavior'], cdr['specify_other_behavior'].notnull())

specify_other_behavior,False,True
other_behavior,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,4974,0
1.0,0,176


In [63]:
cdr['other_behavior'] = cdr['specify_other_behavior']
cdr.drop('specify_other_behavior', axis=1, inplace=True)

## Add new columns

### Column for the time (in days) between incarceration (or incident) and death
Note: for a few of these, the death date/time is before the incarceration/incident date. If they are only one day apart, we just call it 0 and assume it was a minor error. If they are more than a day apart, clearly there was a larger error, so we use a NaN value.

In [64]:
def get_days(dt):
    if dt.days < -1:
        return None
    elif dt.days == -1:
        return 0
    else:
        return dt.days

delta = cdr.death_date_and_time - cdr.date_time_of_custody_or_incident
print("For %d records with death date before custoday date, setting the days_from_custody_to_death to NaN" % (delta.dt.days < -1).sum())
cdr['days_from_custody_to_death'] = delta.apply(get_days)

For 5 records with death date before custoday date, setting the days_from_custody_to_death to NaN


In [65]:
cdr['name_full'] = ''
for col in ['first_name', 'middle_name', 'last_name', 'suffix']:
    cdr['name_full'] = cdr['name_full'] + ' ' + cdr[col].fillna('')
cdr['name_full'] = cdr['name_full'].apply(lambda s: ' '.join(s.strip().split()))
cdr.loc[cdr['name_full'] == '', 'name_full'] = np.nan

## Tweak bookkeping columns

In [66]:
cdr['num_revisions'] = cdr['version_number'] - 1
cdr.drop(['version_type', 'version_number'], axis=1, inplace=True)

## Rename and reorder columns sensibly

In [67]:
col_renames = {
    'first_name': 'name_first',
    'middle_name': 'name_middle',
    'last_name': 'name_last',
    'suffix': 'name_suffix',
    'cdr_cdr_name': 'record_number',
    'death_causer_other': 'who_caused_death_in_homicide_or_accident_other',
    'who_caused_the_death': 'who_caused_death_in_homicide_or_accident',
    'death_location': 'death_location_type',
    'death_location_elsewhere': 'death_location_type_other',
    'city': 'death_location_city',
    'county': 'death_location_county',
    'latitude': 'death_location_latitude',
    'longitude': 'death_location_longitude',
    'census_tract': 'death_location_census_tract',
    'street_address': 'death_location_street_address',
    'entry_date_time': 'facility_entry_date_time',
    'pre_existing_medical_condition': 'death_from_pre_existing_medical_condition',
}

In [68]:
cdr.rename(col_renames, axis=1, inplace=True)

In [69]:
after = cdr[[
    # Record indexing columns
    'record_number',
    'num_revisions',
    'form_version',
    'report_date',
    'date_time_of_custody_or_incident',

    # Deceased personal information, demographics
    'name_first',
    'name_last',
    'name_middle',
    'name_suffix',
    'name_full',
    'date_of_birth',
    'age_at_time_of_death',
    'sex',
    'race',

    # Death event information
    'death_date_and_time',
    'death_location_county',
    'death_location_city',
    'death_location_street_address',
    'death_location_latitude',
    'death_location_longitude',
    'death_location_census_tract',
    'death_location_type',
    'death_location_type_other',
    'death_from_pre_existing_medical_condition',
    'manner_of_death',
    'manner_of_death_description',
    'means_of_death',
    'means_of_death_other',
    'medical_cause_of_death',
    'medical_examinor_coroner_evalution',
    'medical_treatment',
    'days_from_custody_to_death',
    'who_caused_death_in_homicide_or_accident',
    'who_caused_death_in_homicide_or_accident_other',

    # Criminal information on deceased
    'offense_1',
    'offense_2',
    'offense_3',
    'were_the_charges',

    # Facility and agency information
    'facility_entry_date_time',
    'type_of_custody',
    'specific_type_of_custody_facility',
    'agency_address',
    'agency_city',
    'agency_county',
    'agency_name',
    'agency_zip',
    
    # Deceased behavior upon entry or custody
    'entry_behavior',
    'other_behavior',
    'exhibit_any_medical_problems',
    'exhibit_any_mental_health_problems',
    'make_suicidal_statements',
]]
assert cdr.shape == after.shape
cdr = after

In [70]:
cdr.sample(10)

Unnamed: 0,record_number,num_revisions,form_version,report_date,date_time_of_custody_or_incident,name_first,name_last,name_middle,name_suffix,name_full,date_of_birth,age_at_time_of_death,sex,race,death_date_and_time,death_location_county,death_location_city,death_location_street_address,death_location_latitude,death_location_longitude,death_location_census_tract,death_location_type,death_location_type_other,death_from_pre_existing_medical_condition,manner_of_death,manner_of_death_description,means_of_death,means_of_death_other,medical_cause_of_death,medical_examinor_coroner_evalution,medical_treatment,days_from_custody_to_death,who_caused_death_in_homicide_or_accident,who_caused_death_in_homicide_or_accident_other,offense_1,offense_2,offense_3,were_the_charges,facility_entry_date_time,type_of_custody,specific_type_of_custody_facility,agency_address,agency_city,agency_county,agency_name,agency_zip,entry_behavior,other_behavior,exhibit_any_medical_problems,exhibit_any_mental_health_problems,make_suicidal_statements
126,PA05127CJ,0,V_2005,2005-07-26 15:29:00,2005-05-23 18:52:00,BRANDON,ASUNCION,S,,BRANDON S ASUNCION,1986-04-06,19.0,M,HISPANIC,2005-06-28 18:30:00,BELL,BELTON,111 W. CENTRAL AVE.,,,,MEDICAL FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,STAPHYLOCOCCUS,NOT APPLICABLE,,METHICILLIN-RESISTANT STAPHYLOCOCCUS AUREUS SE...,"YES, RESULTS ARE AVAILABLE",NO,35.0,NOT APPLICABLE,,INJURY TO A CHILD,,,FILED,2005-05-23 18:52:00,JAIL,HOSPITAL/INFIRMARY,P. O. BOX 749,BELTON,BELL,BELL CO SHERIFFS OFFICE,76513,,,,,
4976,PA16421P,1,V_2005,2016-09-26 15:14:00,1996-03-21 00:00:00,VAUGHN,BIRDWELL,,,VAUGHN BIRDWELL,1949-10-18,66.0,M,WHITE,2016-09-20 18:39:00,ANDERSON,TENNESSEE COLONY,2664 FM 2054,,,,LAW ENFORCEMENT FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,COMPLICATIONS FROM DECOMPENSATED AND HEPATOCEL...,NOT APPLICABLE,,COMPLICATIONS FROM DECOMPENSATED AND HEPATOCEL...,"YES, RESULTS ARE AVAILABLE",YES,7488.0,NOT APPLICABLE,,MURDER,,,CONVICTED,1996-03-21 00:00:00,PRISON,TDCJ,"2503 LAKE ROAD, SUITE 5",HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77340,,,,,
940,18-154-P,0,V_2016,2018-03-16 10:07:00,2017-01-12 00:00:00,DANIEL,CAVAZOS,,,DANIEL CAVAZOS,1945-08-13,72.0,M,HISPANIC,2018-03-10 07:18:00,ANDERSON,PALESTINE,1400 FM 3452,,,,MEDICAL FACILITY,,NOT APPLICABLE,NATURAL CAUSES/ILLNESS,,NOT APPLICABLE,,STROKE,"YES, RESULTS PENDING",NOT APPLICABLE,422.0,NOT APPLICABLE,,FAILURE TO COMPLY AS SEX OFFENDER,,,CONVICTED,2017-01-12 00:00:00,PRISON,TDCJ,1012 VETERANS MEMORIAL PARKWAY,HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77320,,,,,
2761,PA13187P,0,V_2005,2013-05-02 10:59:00,2009-10-14 00:00:00,REX,HOWARD,,,REX HOWARD,1955-12-10,57.0,M,WHITE,2013-04-23 17:32:00,GALVESTON,GALVESTON,809 HARBORSIDE DRIVE,,,,MEDICAL FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,CANCER,NOT APPLICABLE,,PANCREATIC CANCER,"YES, RESULTS ARE AVAILABLE",YES,1287.0,NOT APPLICABLE,,ARSON - HABITATION,,,CONVICTED,2009-10-14 00:00:00,PRISON,TDCJ,"2503 LAKE ROAD, SUITE 5",HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77340,,,,,
366,17-280-P,1,V_2016,2017-05-25 13:42:00,2011-12-21 00:00:00,FREDRICK,MUNGER,,,FREDRICK MUNGER,1945-01-01,72.0,M,WHITE,2017-05-22 16:38:00,ANDERSON,TENNESSE COLONY,2664 FM 2054,,,,LAW ENFORCEMENT FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,,NOT APPLICABLE,,SUDDEN CARDIAC DEATH DUE TO UNDERLYING CARDIOV...,"YES, RESULTS ARE AVAILABLE",YES,1979.0,NOT APPLICABLE,,AGGRAVATED SEXUAL ASSAULT,,,CONVICTED,2011-12-21 00:00:00,PRISON,TDCJ,"2503 LAKE ROAD, SUITE 5",HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77340,,,UNKNOWN,UNKNOWN,UNKNOWN
2642,PA13067P,0,V_2005,2013-02-20 14:32:00,2000-02-08 00:00:00,RICHARD,GRAHAM,WAYNE,,RICHARD WAYNE GRAHAM,1935-05-06,77.0,M,WHITE,2013-02-07 16:50:00,GALVESTON,GALVESTON,809 HARBORSIDE DRIVE,,,,MEDICAL FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,MULTI-ORGAN FAILURE,NOT APPLICABLE,,MULTI-ORGAN FAILURE,"NO, EVALUATION NOT PLANNED",YES,4748.0,NOT APPLICABLE,,AGGRAVATED SEXUAL ASSAULT OF A CHILD,,,CONVICTED,2000-02-08 00:00:00,PRISON,TDCJ,"2503 LAKE ROAD, SUITE 5",HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77340,,,,,
5007,PA16475MJ,0,V_2005,2016-10-07 00:00:00,2016-10-03 14:00:00,JOSEPH,GUNTER,ALLEN,,JOSEPH ALLEN GUNTER,1982-11-08,33.0,M,HISPANIC,2016-10-03 20:20:00,FORT BEND,ROSENBERG,2120 4TH STREET,,,,LAW ENFORCEMENT FACILITY,,NOT APPLICABLE,SUICIDE,,"HANGING, STRANGULATION",,PENDING AUTOPSY REPORT,"YES, RESULTS PENDING",NOT APPLICABLE,0.0,NOT APPLICABLE,,POSSESSION OF CONTROLLED SUBSTANCE,,,NOT FILED AT TIME OF DEATH,2016-10-03 14:00:00,JAIL,JAIL - SINGLE CELL,2120 4TH ST.,ROSENBERG,FORT BEND,ROSENBERG POLICE DEPT,77471,,,,,
251,17-165-P,0,V_2016,2017-03-31 16:46:00,2016-04-07 00:00:00,ISAAC,DUARTE,,,ISAAC DUARTE,1940-07-19,76.0,M,HISPANIC,2017-03-22 23:45:00,GALVESTON,GALVESTON,809 HARBORSIDE DRIVE,,,,MEDICAL FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,,NOT APPLICABLE,,SUBARACHNOID HEMORRHAGE WITH CARDIOPULMONARY A...,"NO, EVALUATION NOT PLANNED",UNKNOWN,349.0,NOT APPLICABLE,,MURDER,,,CONVICTED,2016-04-07 00:00:00,PRISON,TDCJ,"2503 LAKE ROAD, SUITE 5",HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77340,,,UNKNOWN,UNKNOWN,UNKNOWN
3264,PA14029P,0,V_2005,2014-01-31 10:07:00,2010-10-06 00:00:00,MICKEY,ROBINETT,,,MICKEY ROBINETT,1949-09-25,64.0,M,WHITE,2014-01-17 09:40:00,ANDERSON,TENNESSEE COLONY,2664 FM 2054,,,,LAW ENFORCEMENT FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,LIVER & BONE CANCER,NOT APPLICABLE,,LIVER & BONE CANCER,"NO, EVALUATION NOT PLANNED",YES,1199.0,NOT APPLICABLE,,AGGRAVATED SEXUAL ASSAULT OF A CHILD TWO COUNTS,SEXUAL PERFORMANCE OF A CHILD THREE COUNTS,,CONVICTED,2010-10-06 00:00:00,PRISON,TDCJ,"2503 LAKE ROAD, SUITE 5",HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77340,,,,,
4333,PA15476C,0,V_2005,2015-09-20 12:33:00,2015-08-27 22:00:00,SILVESTRE,GARCIA,,,SILVESTRE GARCIA,1967-10-20,47.0,M,HISPANIC,2015-08-30 00:51:00,BEXAR,SAN ANTONIO,700 CRESTHILL,,,,MEDICAL FACILITY,,NOT APPLICABLE,HOMICIDE,,DRUG OVERDOSE,,EXCITED DELIRIUM RESULTING FROM THE TOXIC EFFE...,"YES, RESULTS ARE AVAILABLE",NOT APPLICABLE,2.0,NOT APPLICABLE,,ASSAULT - BODILY INJURY,,,NOT FILED AT TIME OF DEATH,NaT,POLICE CUSTODY (PRE-BOOKING),CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FL...,315 SOUTH SANTA ROSA,SAN ANTONIO,BEXAR,SAN ANTONIO POLICE DEPT,78207,,,,,


In [71]:
show_notnull_chart(cdr)

Unnamed: 0,age_at_time_of_death,agency_address,agency_city,agency_county,agency_name,agency_zip,date_of_birth,date_time_of_custody_or_incident,days_from_custody_to_death,death_date_and_time,death_from_pre_existing_medical_condition,death_location_census_tract,death_location_city,death_location_county,death_location_latitude,death_location_longitude,death_location_street_address,death_location_type,death_location_type_other,entry_behavior,exhibit_any_medical_problems,exhibit_any_mental_health_problems,facility_entry_date_time,form_version,make_suicidal_statements,manner_of_death,manner_of_death_description,means_of_death,means_of_death_other,medical_cause_of_death,medical_examinor_coroner_evalution,medical_treatment,name_first,name_full,name_last,name_middle,name_suffix,num_revisions,offense_1,offense_2,offense_3,other_behavior,race,record_number,report_date,sex,specific_type_of_custody_facility,type_of_custody,were_the_charges,who_caused_death_in_homicide_or_accident,who_caused_death_in_homicide_or_accident_other
V_2005 forms,1,1,1,1.0,1,1,1,1,0.999029,1,1,0,0.998835,1,0,0,0.999806,1,0.0192233,0.141748,0.0,0.0,0.693592,1,0.0,1,0.589515,1,0.0831068,0.999417,1,0.999806,1,1,1,0.559806,0.0706796,1,0.998835,0.261553,0.0819417,0.0341748,1,1,1.0,1,1.0,1,1,1,0.00912621
V_2016 forms,1,1,1,0.995516,1,1,1,1,1.0,1,1,0,0.999103,1,0,0,1.0,1,0.0197309,0.0,0.880717,0.875336,0.773991,1,0.874439,1,0.161435,1,0.10852,0.996413,1,1.0,1,1,1,0.373094,0.0466368,1,0.994619,0.266368,0.104036,0.0,1,1,0.999103,1,0.955157,1,1,1,0.0


## Write

In [72]:
with dw.open_remote_file(DTW_PROJECT_KEY_CDR, CLEANED_FILENAME) as w:
    print("Writing to data.world:", CLEANED_FILENAME)
    cdr.to_csv(w, index=False)

Writing to data.world: cleaned_custodial_death_reports.csv


---
## TODOs -- problems that still exist in the data and need work
---

### [TODO] The list of possible offenses is enourmous and needs condensing
#### Of ~4600 custodial deaths, there are 2000 unique `offense_1` values, e.g.:

In [85]:
print('Of %d custodial deaths, there are %d unique offense_1 values' % (len(cdr), len(set(cdr.offense_1))))
cdr.offense_1.value_counts().head(20)

Of 6265 custodial deaths, there are 2293 unique offense_1 values


MURDER                                     369
AGGRAVATED SEXUAL ASSAULT OF A CHILD       224
AGGRAVATED ROBBERY                         166
DRIVING WHILE INTOXICATED                  136
AGGRAVATED SEXUAL ASSAULT                  119
AGGRAVATED ASSAULT WITH A DEADLY WEAPON    114
ROBBERY                                    112
INDECENCY WITH A CHILD                     110
AGGRAVATED ASSAULT                         106
CAPITAL MURDER                             102
PUBLIC INTOXICATION                         94
BURGLARY OF A HABITATION                    88
POSSESSION OF A CONTROLLED SUBSTANCE        83
ATTEMPTED CAPITAL MURDER                    72
POSSESSION OF CONTROLLED SUBSTANCE          71
DWI                                         66
MURDER WITH A DEADLY WEAPON                 64
AGGRAVATED ROBBERY WITH A DEADLY WEAPON     60
SEXUAL ASSAULT OF A CHILD                   54
BURGLARY OF HABITATION                      51
Name: offense_1, dtype: int64

#### Clearly some of these could be condensed. See the possibilities for offenses containing the word 'MURDER':

In [74]:
s = cdr.offense_1.dropna()
murder = s[s.str.contains('MURDER')]
N = 20
s = murder.value_counts()[:N]
s['(...%d other types of murder)' % (len(set(murder)) - N)] = murder.value_counts()[N:].sum()
s

MURDER                                          369
CAPITAL MURDER                                  102
ATTEMPTED CAPITAL MURDER                         72
MURDER WITH A DEADLY WEAPON                      64
MURDER WITH DEADLY WEAPON                        18
ATTEMPTED CAPITAL MURDER OF A POLICE OFFICER     17
ATTEMPTED MURDER                                 13
ATTEMPTED CAPITAL MURDER OF A PEACE OFFICER      11
CRIMINAL ATTEMPT CAPITAL MURDER                  11
ATTEMPTED CAPITAL MURDER OF PEACE OFFICER         9
MURDER WITH A DEADLY WPN                          8
ATT. CAPITAL MURDER                               6
ATTEMPTED CAPITAL MURDER OF POLICE OFFICER        6
MURDER WITH DEADLY WPN                            5
ATTEMPT CAPITAL MURDER                            4
MURDER WITH MALICE                                4
MURDER W/DEADLY WPN                               3
ATTEMPTED MURDER-ENHANCED                         3
CAPITAL MURDER WITH DEADLY WEAPON                 3
MURDER F/1  

#### Something like this might extract a useful simplified category (murders and attempted murders)

In [75]:
def is_attempt(x):
    parts = x.split()
    parts = [''.join(ch for ch in p if ch.isalpha()) for p in parts]
    for w in ['ATT', 'ATTEMPT', 'ATTEMPTED']:
        if w in parts:
            return True
    return False

att = murder.apply(is_attempt)
att_murder = murder[att]
comm_murder = murder[~att]
set(att_murder)

{'ATT CAPITAL MURDER',
 'ATT MURDER',
 'ATT MURDER WITH A DEADLY WPN',
 'ATT. CAPITAL MURDER',
 'ATT. CAPITAL MURDER OF PEACE OFFICER',
 'ATTEMPT CAP MURDER PO',
 'ATTEMPT CAPITAL MURDER',
 'ATTEMPT CAPITAL MURDER OF A POLICE OFFICER',
 'ATTEMPT CAPITAL MURDER-POLICE OFFICER',
 'ATTEMPT CAPITAL MURDER/ PEACE OFFICER',
 'ATTEMPT CAPITAL MURDER/POLICE OFFICER',
 'ATTEMPT CAPITOL MURDER WITH A DEADLY WEAPON',
 'ATTEMPT TO COMMIT CAPITAL MURDER',
 'ATTEMPTED CAP. MURDER OF A PEACE OFFICER',
 'ATTEMPTED CAPITAL MURDER',
 'ATTEMPTED CAPITAL MURDER (PEACE OFFICER)',
 'ATTEMPTED CAPITAL MURDER - PEACE OFFICER',
 'ATTEMPTED CAPITAL MURDER - PEACE OFFICER/FIREARM',
 'ATTEMPTED CAPITAL MURDER - POLICE OFFICER',
 'ATTEMPTED CAPITAL MURDER 5 COUNTS',
 'ATTEMPTED CAPITAL MURDER DEADLY WEAPON',
 'ATTEMPTED CAPITAL MURDER OF A CHILD',
 'ATTEMPTED CAPITAL MURDER OF A PEACE OFFICER',
 'ATTEMPTED CAPITAL MURDER OF A POLICE OFFICER',
 'ATTEMPTED CAPITAL MURDER OF A POLICE OFFICER (X 3)',
 'ATTEMPTED CAPIT

### [TODO] Manner of death is sometimes 'OTHER' when it should not be
Two columns detail the manner of death: `manner_of_death` and `manner_of_death_description`. A handful of rows have `manner_of_death` = `OTHER` when they should have picked `ACCIDENT` or `HOMICIDE` -- evidenced by the information in the `manner_of_death_description` column:

In [76]:
frame = cdr[(cdr.manner_of_death == OTHER_SPECIFY) & (cdr.manner_of_death_description.notnull())]
for exempt in ('PENDING', 'DETERMIN'):
    frame = frame[~frame.manner_of_death_description.str.contains(exempt)]
print(len(frame), len(cdr))
frame[['form_version', 'manner_of_death_description', 'manner_of_death', 'medical_cause_of_death']]

78 6265


Unnamed: 0,form_version,manner_of_death_description,manner_of_death,medical_cause_of_death
392,V_2005,BLUNT FORCE HEAD AND NECK INJURY,"OTHER, SPECIFY",BLUNT FORCE HEAD AND NECK INJURY
787,V_2005,SPONTANEOUS ABORTION,"OTHER, SPECIFY",SPONTANEOUS ABORTION
884,V_2005,ACCIDENT,"OTHER, SPECIFY",TOXIC EFFECTS OF OXYCODONE AND DIPHENHYDRAMINE...
1116,V_2005,"SICKLE CELL DISEASE WITH ACUTE SICKLE CRISIS, ...","OTHER, SPECIFY","SICKLE CELL DISEASE WITH ACUTE SICKLE CRISIS, ..."
1182,V_2005,FETAL DEMISE,"OTHER, SPECIFY",FETAL DEMISE
1257,V_2005,M/V ACCIDENT,"OTHER, SPECIFY",MULTIPLE BLUNT FORCE INJURIES
1269,V_2005,CARDIOVASCULAR DISEASE,"OTHER, SPECIFY",ATHEROSCLEROTIC CARDIOVASCULAR DISEASE
1272,V_2005,VISCERAL CONGESTION WITH HEMORRHAGIC PULMONARY...,"OTHER, SPECIFY",VISCERAL CONGESTION WITH HEMORRHAGIC PULMONARY...
1291,V_2005,MV ACCIDENT,"OTHER, SPECIFY",MULTIPLE BLUNT FORCE INJURIES
1294,V_2005,NORTRIPTYLINE INTOXICATION,"OTHER, SPECIFY",NORTRIPTYLINE INTOXICATION WITH ASPIRATION OF ...


### [TODO] (Related) Extract motor vehicle accidents into a unique type

In [77]:
# This starter code should get it going
frame = cdr[(cdr.manner_of_death != 'ACCIDENTAL')]
frame = frame[(frame.medical_cause_of_death.fillna('').str.contains('ACCIDENT')) |
              frame.manner_of_death_description.fillna('').str.contains('ACCIDENT')]
print(len(frame))
frame[['medical_cause_of_death', 'means_of_death', 'means_of_death_other', 'manner_of_death', 'manner_of_death_description']]

37


Unnamed: 0,medical_cause_of_death,means_of_death,means_of_death_other,manner_of_death,manner_of_death_description
80,MASSIVE LEFT CEREBRAL VASCULAR ACCIDENT (STROKE),NOT APPLICABLE,,NATURAL CAUSES/ILLNESS,STROKE
884,TOXIC EFFECTS OF OXYCODONE AND DIPHENHYDRAMINE...,NOT APPLICABLE,,"OTHER, SPECIFY",ACCIDENT
1174,TOXIC EFFECTS OF COCAINE (ACCIDENT),NOT APPLICABLE,,ALCOHOL/DRUG INTOXICATION,
1257,MULTIPLE BLUNT FORCE INJURIES,"OTHER, SPECIFY",M/V ACCIDENT,"OTHER, SPECIFY",M/V ACCIDENT
1291,MULTIPLE BLUNT FORCE INJURIES,"OTHER, SPECIFY",MV ACCIDENT,"OTHER, SPECIFY",MV ACCIDENT
1294,NORTRIPTYLINE INTOXICATION WITH ASPIRATION OF ...,NOT APPLICABLE,,"OTHER, SPECIFY",NORTRIPTYLINE INTOXICATION
1456,ANOXIC ENCEPHALOPATHY AND MULTIPLE ORGAN FAILU...,"OTHER, SPECIFY",UNKNOWN,"OTHER, SPECIFY",ACCIDENT
1491,"MYCARDIAL INFARCTION INCLUDING HYPERTENSION, H...",NOT APPLICABLE,,NATURAL CAUSES/ILLNESS,CARDIAC ARREST
1688,RIGHT INTRA CEREBRAL HEMORRAHAGE DUE TO HYPERT...,NOT APPLICABLE,,"OTHER, SPECIFY",ACCIDENTAL
1699,COMBINED ETHANOL AND COCAINE TOXICITY - ACCIDE...,NOT APPLICABLE,,ALCOHOL/DRUG INTOXICATION,


### [TODO] Fix up `medical_treatment`

In [78]:
pd.crosstab(cdr.medical_treatment, cdr.form_version).sort_values(['V_2005'],ascending=False)

form_version,V_2005,V_2016
medical_treatment,Unnamed: 1_level_1,Unnamed: 2_level_1
NOT APPLICABLE,2901,210
YES,1440,237
NO,808,167
UNKNOWN,0,501


### [TODO] Remove `medical_examinor_coroner_evalution`? Not sure it's useful.

In [79]:
pd.crosstab(cdr.medical_examinor_coroner_evalution, cdr.form_version).sort_values(['V_2005'],ascending=False)

form_version,V_2005,V_2016
medical_examinor_coroner_evalution,Unnamed: 1_level_1,Unnamed: 2_level_1
"YES, RESULTS ARE AVAILABLE",4253,336
"NO, EVALUATION NOT PLANNED",702,353
"YES, RESULTS PENDING",158,426
"NO, EVALUATION PENDING",37,0


### [TODO] Somehow blend the `entry_behavior`/`other_behavior` field from V_2005 with the `exhibit_any_medical_problems`, `exhibit_any_mental_health_problems` and `make_suicidal_statements` from V_2016

In [80]:
pd.crosstab(cdr.entry_behavior, cdr.form_version).sort_values(['V_2005'],ascending=False)

form_version,V_2005
entry_behavior,Unnamed: 1_level_1
MEDICAL,350
INTOXICATED,193
MENTAL,83
MENTAL; MEDICAL,42
INTOXICATED; MEDICAL,29
INTOXICATED; MENTAL,19
INTOXICATED; MENTAL; MEDICAL,9
MEDICAL; NA,4
INTOXICATED; NA,1


In [81]:
pd.crosstab(cdr.exhibit_any_medical_problems, cdr.form_version).sort_values(['V_2016'],ascending=False)

form_version,V_2016
exhibit_any_medical_problems,Unnamed: 1_level_1
UNKNOWN,435
NO,294
YES,253


In [82]:
pd.crosstab(cdr.exhibit_any_mental_health_problems, cdr.form_version).sort_values(['V_2016'],ascending=False)

form_version,V_2016
exhibit_any_mental_health_problems,Unnamed: 1_level_1
UNKNOWN,558
NO,320
YES,98


In [83]:
pd.crosstab(cdr.make_suicidal_statements, cdr.form_version).sort_values(['V_2016'],ascending=False)

form_version,V_2016
make_suicidal_statements,Unnamed: 1_level_1
UNKNOWN,518
NO,381
YES,76


### [TODO] Align various binary behavioral columns from each form version, add to final output

In [84]:
# These columns were completely removed. Some can stay once paired off
drop_cols

{'Agency Phone',
 'Agency State',
 'Appear intoxicated (alcohol or drugs)',
 "Attempt gain possession officer's weapon",
 'Attempt to Injure Others?',
 'Barricade self or initiate standoff?',
 'Custody Type Facility',
 'Death Reason',
 'Death Weapon Other Description',
 'Decedent Display or Use Weapon Details',
 'Decedent display/use of weapons',
 'Director First Name',
 'Director Last Name',
 'Director Middle Name',
 'Director Salutation',
 'Escape or attempt to escape/flee custody',
 "Gain possession of officer's weapon",
 'Grab, hit or fight with the officer(s)',
 'Injured By',
 'Location Category',
 'Medical Treatment Description',
 'Other Agencies Respond?',
 'Other Location Category',
 'Other device, specify',
 'Other weapon, specify',
 'Physically attempt/assault officer(s)',
 'Reporter Email',
 'Reporter Name',
 'Reporter Name Original CDR',
 'Resist being handcuffed or arrested?',
 'Specify Weapon Used',
 'Status',
 'TDCJ - Specify Unit',
 'Threaten the officer(s) involved',
 