# Clean and reformat CDR data from its multi-tab excel file into a single csv

### About the data

CDR data is tricky -- the form used by law enforcement has changed over time, first in 2005, then again in 2016. The data before 2005 is known to be be sparse and poorly enforced, so we ignore those entries. The 2005 and 2016 versions of the form have some overlap and some differences, so we must be careful in how we merge them.

Further, some older deaths that happened in an inpatient setting are not reported in these dataset. In the past, for the first version of the TJI website, these records were requested. We have an "old master file" that was used in the first TJI website that includes these deaths, but also many others. We need to load this file and include any records that we don't yet have.

**In this repo you can find blank versions of the [2005](https://github.com/texas-justice-initiative/data-processing/blob/master/forms/CDR%20Form%20Version%202005.pdf) and [2016](https://github.com/texas-justice-initiative/data-processing/blob/master/forms/CDR%20Form%20Version%202016.pdf) forms, to see for yourself exactly what fields are collected and how.**

### Datasets used


* Input:
  * `tji/raw-and-processing/CDR - All Reports.xlsx`
  * `tji/raw-and-processing/reformatted_cdr_2017_master_file`
  * `tji/auxiliary-datasets/agencies_and_counties`
* Output:
  * `tji/deaths-in-custody/cleaned_custodial_death_reports.csv`
  
##### Author: Everett Wetchler (everett.wetchler@gmail.com)

## STEPS

**1. Setup and read data**
- 1a. Configuration and imports
    - Constants
    - Libraries
- 1.b Read data
    - Drop records with death date pre-2005
- 1c. Helper methods/code
    

**2. Join disparate datasets, rename and reformat columns**
- 2a. Reduce columns used and rename
    - Drop columns that we either don't need or haven't written code to clean yet
    - Rename columns to be more machine friendly
- 2b. Add rows from the old master file to get BJS inpatient deaths 
    - Merge the "old master" file with the other CDR records
- 2c. Column type conversions
    - Convert date columns
    - Float columns covert from string
    - Upcase string columns
    - Add `death_date` column
    
**3. Begin Deep Cleaning**
- 3a. Standardize `race` values
- 3b. Standardize agency information
- 3c. Fix death-information related columns
    - Fix `Death Location`
    - fix `Means of Death`
    - Fix `manner_of_death` and `manner of death`
    - Fix `pre_existing_medical_condition`
    - Fix `who_caused_the_death`
    - Fix `medical_examinor_coroner_evalution`
- 3d. Fix other columns one-by-one
    - Standardize `gender`
    - Fix `were_there_charges`
    - Fix `type_of_custody`
    - Fix `specific_type_of_custody_facility`
- 3e. Identify and drop a range of unnecessary columns
    - `entry_date_time_n_a` and `custody_date_na`
    - `department_type`

**4. Deduplication**
- 4a. Whole row duplicates
- 4b. Define some dedup functions
- 4c. Dedup
- 4d. Inspect results

**5. Add new columns, rename/reorder, otherwise finalize dataframe**
- 5a. New columns and bookkeeping columns
    - Column for the time (in days) between incarceration (or incident) and death
- 5b. Rename and reorder columns sensibly
- 5c. Final look before writing


**6. Write**


**7. TODOs -- problems that still exist in the data and need work**

---
Let's begin...

<div style="background-color: #080;height: 20px"></div>

# 1. Setup and read data

<div style="background-color: #008;height: 20px"></div>

## 1a. Configuration and imports

In [15]:
# Constants so we don't have to hunt through the code to tweak these in many places

DW_PROJECT_CDR = 'tji/deaths-in-custody'
DW_PROJECT_OIS = 'tji/officer-involved-shootings'
DW_PROJECT_AUXILIARY_DATASETS = 'tji/auxiliary-datasets'
DW_PROJECT_RAW_AND_PROCESSING = 'tji/raw-and-processing'

OUTPUT_DW_PROJECT = DW_PROJECT_CDR
OUTPUT_FILENAME = 'cleaned_custodial_death_reports.csv'

In [16]:
# Import ALL the things

import os
import sys
import json
import boto3
import datetime
import numpy as np
import pandas as pd
import datadotworld as dw
import pygsheets
import getpass

from io import StringIO
from lib.cleaning_tools import *

sys.path.append(os.getcwd() + '/../data_cleaning')

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

%load_ext watermark
%watermark -a "Everett Wetchler, Aiden Yang" -d -t -z -w -p numpy,pandas,datadotworld,pygsheets

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Everett Wetchler, Aiden Yang 2020-12-31 17:21:12 MST 

numpy 1.17.2
pandas 1.1.2
datadotworld 1.7.0
pygsheets 2.0.3.1
watermark 2.0.2


In [17]:
# Custom libraries specific to this project
from lib.standardize_police_agency_names import standardize_agency_name

<div style="background-color: #008;height: 20px"></div>

## 1.b Read data and take a glance

In [18]:
datasets = dw.load_dataset(DW_PROJECT_AUXILIARY_DATASETS, force_update=True)
agency_county = datasets.dataframes['agencies_and_counties']
agency_county = agency_county.set_index('agency')['county'].to_dict()

In [19]:
if getpass.getuser() == 'hlukas':
    gc = pygsheets.authorize('/Users/hlukas/Downloads/client_secret_lghhager.json')
else:
    gc = pygsheets.authorize(service_file='/home/ec2-user/data-processing/automation/client_secret.json')

In [20]:
dir(gc)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'create',
 'drive',
 'get_range',
 'logger',
 'oauth',
 'open',
 'open_all',
 'open_as_json',
 'open_by_key',
 'open_by_url',
 'sheet',
 'spreadsheet_cls',
 'spreadsheet_ids',
 'spreadsheet_titles',
 'teamDriveId']

In [21]:
if getpass.getuser() == 'hlukas':
    gc.drive.enable_team_drive('0ACeQWapAwOLqUk9PVA')


    titles = gc.spreadsheet_titles()
    ids = gc.spreadsheet_ids()

    cdr_id = [ids[index] for index in range(len(titles)) if titles[index] == 'CDR Reports All'][0]
    output = gc.open_by_key(cdr_id)

else:
    #gc.enableTeamDriveSupport = True
    #gc.teamDriveId = '0ACeQWapAwOLqUk9PVA'
    output = gc.open('OIS')

dfs = {}

# drop empty columns or the concatenation does not work
for sheet_name in ['Form Version 2005', 'Form Version 2016', 'Older Forms']:
    sheet = [i for i in range(4) if output.worksheets()[i].title == sheet_name][0]
    wk = output[sheet]
    try:
        dfs[sheet_name] = wk.get_as_df(include_tailing_empty = True).drop(columns = [''], axis= 1)
    except KeyError:
        dfs[sheet_name] = wk.get_as_df(include_tailing_empty = True)
    
dfs['Form Version 2005']['form_version'] = 'V_2005'
dfs['Form Version 2016']['form_version'] = 'V_2016'
dfs['Older Forms']['form_version'] = 'V_OLDER'

cdr = pd.concat([dfs['Form Version 2005'], dfs['Form Version 2016']])
print('Read %d raw CDR records. Ignoring %d from older form versions (pre-2005).' % (
    sum(len(x) for x in dfs), len(dfs['Older Forms'])))
print('Keeping %d records using form version 2005, and %d using version 2016 (keeping %d in total)' % (
    len(dfs['Form Version 2005']), len(dfs['Form Version 2016']), len(cdr)))

Read 45 raw CDR records. Ignoring 5468 from older form versions (pre-2005).
Keeping 5156 records using form version 2005, and 4764 using version 2016 (keeping 9920 in total)


In [22]:
# datasets = dw.load_dataset(DW_PROJECT_AUXILIARY_DATASETS, force_update=True)
# agency_county = datasets.dataframes['agencies_and_counties']
# agency_county = agency_county.set_index('agency')['county'].to_dict()

In [23]:
# dfs = read_dtw_excel(DW_PROJECT_RAW_AND_PROCESSING, 'original/CDR Reports All.xlsx')
# dfs['Form Version 2005']['form_version'] = 'V_2005'
# dfs['Form Version 2016']['form_version'] = 'V_2016'
# dfs['Older Forms']['form_version'] = 'V_OLDER'
# cdr = pd.concat([dfs['Form Version 2005'], dfs['Form Version 2016']])
# print('Read %d raw CDR records. Ignoring %d from older form versions (pre-2005).' % (
#     sum(len(x) for x in dfs), len(dfs['Older Forms'])))
# print('Keeping %d records using form version 2005, and %d using version 2016 (keeping %d in total)' % (
#     len(dfs['Form Version 2005']), len(dfs['Form Version 2016']), len(cdr)))

### In theory, all these records should be for deaths in 2005 or later. Let's double check and drop any miscreants.

In [24]:
assert cdr['Death Date and Time'].isnull().sum() == 0

In [25]:
cdr['Death Date and Time'] = pd.to_datetime(cdr['Death Date and Time'])


In [26]:
before = len(cdr)
cdr = cdr[cdr['Death Date and Time'].dt.year >= 2005]
after = len(cdr)
print('Dropped %d (of %d) reports for deaths before 2005, leaving %d' % (before - after, before, after))

Dropped 6 (of 9920) reports for deaths before 2005, leaving 9914


In [27]:
print("Data goes from %s to %s" % (cdr['Death Date and Time'].min().strftime("%Y-%m-%d"),
                                   cdr['Death Date and Time'].max().strftime("%Y-%m-%d")))

Data goes from 2005-01-01 to 2020-12-09


### A quick look at the raw data

In [28]:
dfs['Form Version 2005'].head()

Unnamed: 0,CDR: CDR Name,Report Date,Status,Version Type,Version Number,Department Type,Agency Name,Agency Phone,Agency Address,Agency City,Agency County,Agency State,Agency Zip,Director Salutation,Director First Name,Director Middle Name,Director Last Name,Reporter Name Original CDR,Reporter Email,Street Address,City,County,Type of Custody,Specific Type of Custody/Facility,Custody Type Facility,Entry Date Time,Entry Date Time N/A,Death Location,Death Location Elsewhere,First Name,Middle Name,Last Name,Suffix,Date of Birth,Sex,Ethnicity,Ethnicity Other,Death Date and Time,Age At Time Of Death,Date/Time of Custody or Incident,Custody Date NA,Medical Examinor/Coroner Evalution?,Manner of Death,Manner of Death Description,Death Reason,Medical Cause of Death,Medical Treatment,Medical Treatment Description,Who caused the death?,Death Causer Other,Type of Death Weapon,Death Weapon Other Description,Pre existing medical condition?,Means of Death,Means of Death Other,Offense 1,Offense 2,Offense 3,Were the Charges:,Type of Offense,"Type of Offense, Other",Injured By,Threaten the officer(s) involved,Resist being handcuffed or arrested?,Try to escape/flee from custody,"Grab, hit or fight with the officer(s)",Other Behavior,Specify Other Behavior,Appear intoxicated (alcohol or drugs),Use weapon threaten/assault officer(s),Entry Behavior,Under Restraint,Type of Restraint,"Other device, specify",form_version
0,PA05001C,3/2/05 14:48,Submitted,ORIGINAL VERSION,1,POLICE,Fort Worth Police Dept.,8178778022,350 W. Belknap,Fort Worth,TARRANT,TX,76102,Chief,Ralph,,Mendoza,Renee Gray,renee.gray@oag.state.tx.us,1509 W. Hammond,Fort Worth,Tarrant,Police Custody (pre-booking),Custody of Peace Officer during/fleeing arrest,,,1,At the crime/arrest scene,,Dino,,Gomez,,4/3/1964,Male,Hispanic,,1/1/05 15:04,40,1/1/05 15:04,0,"Yes, results are available",Justifiable Homicide,,Injuries only,Gunshot Wound to the Chest,Not Applicable,,Law enforcement/correctional staff,,Handgun,,Not Applicable; cause of death was accidental ...,Firearm,,Aggravated Assault,,,Not filed at time of death,,,Injured by Officer,Yes,Yes,Yes,No,0,,No,1,,No,,,V_2005
1,PA05002CJ,3/3/05 14:07,Submitted,ORIGINAL VERSION,1,SHERIFF,Travis County Sheriff's Dept.,5128549770,P. O. Box 1748,Austin,TRAVIS,TX,78767,Sheriff,Margo,L.,Frasier,Renee Gray,renee.gray@oag.state.tx.us,3614 Bill Price Road,Del Valle,Travis,County Jail,Jail - multiple occupancy cell,,10/20/04 20:56,0,At medical facility,,Michael,Darnell,Dickson,,3/4/1953,Male,African-American,,1/1/05 15:59,51,10/20/04 20:56,0,"Yes, results are available",Natural Causes/Illness,Cardiac Arrest,Medical condition only (e.g. heart attack),Myocardial infarction,Yes,"Treatment for hypertension, administered Proca...","Not applicable; cause of death was suicide, in...",,Not Applicable,,Pre-existing medical condition,Not applicable; cause of death was intoxicatio...,,Man/Del/Sell/Poss Controlled Substance,,,Filed,,,Injured by NA,No,No,No,No,0,,No,0,Medical,No,,,V_2005
2,PA05003P,3/3/05 14:16,Submitted,ORIGINAL VERSION,1,STAGENCY,Texas Department Of Criminal Justice,9364376716,P.O. Box 4003,Huntsville,TRAVIS,TX,773424003,Mr.,Chris,H.,Stallings,Renee Gray,renee.gray@oag.state.tx.us,21 FM 247,Huntsville,Walker,Penitentiary,TDCJ,Byrd,12/17/04 9:00,0,At medical facility,,Darryl,Glenn,Wallace,,12/3/1960,Male,African-American,,1/2/05 4:35,44,12/17/04 9:00,0,"Yes, results are available",Natural Causes/Illness,Cardiac,Medical condition only (e.g. heart attack),Cardiac death secondary to cardiac ischemia th...,Yes,Nitroglycerin,"Not applicable; cause of death was suicide, in...",,Not Applicable,,Pre-existing medical condition,Not applicable; cause of death was intoxicatio...,,Burglary of Habitation,,,Convicted,,,Injured by NA,No,No,No,No,0,,No,0,,No,,,V_2005
3,PA05004C,3/3/05 14:40,Submitted,ORIGINAL VERSION,1,POLICE,San Antonio Police Dept.,2102077449,"214 W. Nueva, Suite 331",San Antonio,BEXAR,TX,78207,Chief,Albert,,Ortiz,Renee Gray,renee.gray@oag.state.tx.us,2102 Goliad Road,San Antonio,Bexar,Police Custody (pre-booking),Custody of Peace Officer during/fleeing arrest,,,1,At the crime/arrest scene,,Albert,Chavez,Enriquez,,11/28/1983,Male,Hispanic,,1/3/05 15:00,21,1/3/05 14:50,0,"Yes, results are available",Justifiable Homicide,,Injuries only,Multiple Gunshot Wounds,Not Applicable,,Law enforcement/correctional staff,,Handgun,,Not Applicable; cause of death was accidental ...,Firearm,,Aggravated Robbery,,,Not filed at time of death,,,Injured by Officer,Yes,Yes,Yes,Yes,0,,No,1,,No,,,V_2005
4,PA05005C,3/3/05 14:51,Submitted,ORIGINAL VERSION,1,POLICE,Houston Police Dept.,7133081778,"1200 Travis, 17th Floor",Houston,HARRIS,TX,77002,Chief,Clarence,O.,Bradford,Renee Gray,renee.gray@oag.state.tx.us,3600 block of Telephone Rd,Houston,Harris,Police Custody (pre-booking),Custody of Peace Officer during/fleeing arrest,,,1,At the crime/arrest scene,,Alex,,Mendez,,12/30/1978,Male,Hispanic,,1/3/05 17:15,26,1/3/05 17:15,0,"Yes, results are available",Justifiable Homicide,,Injuries only,Multiple Gunshot Wounds,Not Applicable,,Law enforcement/correctional staff,,Handgun,,Not Applicable; cause of death was accidental ...,Firearm,,Traffic Violation,,,Not filed at time of death,,Traffic Violation,Injured by Officer,Yes,No,No,No,0,,No,1,,No,,,V_2005


In [29]:
dfs['Form Version 2016'].head()

Unnamed: 0,CDR: CDR Name,Version Type,Version Number,Report Date,Report Time,Status,Agency Name,Agency Address,Agency City,Agency State,Agency Zip,Director Salutation,Director First Name,Director Middle Name,Director Last Name,Reporter Name,Reporter Email,First Name,Middle Name,Last Name,Suffix,Date of Birth,Sex,Race,Age At Time Of Death,Date/Time of Custody or Incident,Death Date and Time,Medical Examinor/Coroner Evalution?,Manner of Death,Manner of Death Description,Medical Cause of Death,Medical Treatment,Who caused the death?,Type of weapon that caused death?,"Other weapon, specify",Pre existing medical condition?,Means of Death,Means of Death Other,Street Address,City,County,Zip,Location Category,Other Location Category,Type of Custody,Specific Type of Custody/Facility,TDCJ - Specify Unit,Entry Date Time,Entry Date Time N/A,Death Location,Death Location Elsewhere,Other Agencies Respond?,Offense 1,Offense 2,Offense 3,Were the Charges:,Type of Offense,"Type of Offense, Other",Decedent display/use of weapons,Decedent Display or Use Weapon Details,Specify Weapon Used,Attempt to Injure Others?,Appear intoxicated (alcohol or drugs),Make suicidal statements?,Exhibit any mental health problems?,Exhibit any medical problems?,Barricade self or initiate standoff?,Resist being handcuffed or arrested?,Physically attempt/assault officer(s),Gain possession of officer's weapon,Verbally threaten other(s) including law,Escape or attempt to escape/flee custody,Attempt gain possession officer's weapon,Under Restraint,Type of Restraint,"Other device, specify",form_version
0,16-3-C,AMENDED,4,12/12/16 13:02,,Submitted,Texas Department Of Public Safety,PO BOX 4087,Austin,TX,78773,Director,Steven,,McCraw,Joanne Scarbrough,joanne.scarbrough@dps.texas.gov,Ivory,Charles,Pantallion,III,9/14/1980,Male,Black or African American,36,11/22/16 9:26,11/22/16 9:26,"Yes, results are available",Homicide (includes Justifiable Homicide),,Multiple Gunshot Wounds,Not Applicable,Law enforcement/correctional personnel,Handgun; Rifle/shotgun,,Not Applicable; cause of death was accidental ...,Firearm,,7300 Interstate 10 W,Baytown,Harris,77521,Roadway/highway/street/sidewalk,,Police Custody (pre-booking),Custody of Law Enforcement Personnel during/fl...,,,1,Scene of incident,,Yes,Aggravated Assault on Peace Officer,Evading Arrest or Detention,,Not filed at time of death,Violent Crime Against Persons,,"Yes, mark all that apply",Displayed firearm without discharge,,Yes (select all that apply),No,No,No,No,Unknown,Yes,Yes,No,Unknown,Yes,No,No,,,V_2016
1,14-1-C,AMENDED,2,12/13/16 16:20,,Submitted,Texas Department Of Public Safety,PO BOX 4087,Austin,TX,78773,Director,Steven,,McCraw,Joanne Scarbrough,joanne.scarbrough@dps.texas.gov,James,Earl,Nicholas,,4/24/1966,Male,Anglo or White,48,9/11/14 17:00,9/11/14 17:00,"Yes, results are available",Could not be determined,,Multiple Gunshot Wounds,Not Applicable,Law enforcement/correctional personnel,"Firearm, unspecified",,Not Applicable; cause of death was accidental ...,Firearm,,2030 Jacintoport Blvd.,Houston,Harris,77015,Roadway/highway/street/sidewalk,,Police Custody (pre-booking),Custody of Law Enforcement Personnel during/fl...,,,1,Scene of incident,,Yes,Capital Murder - Filed,Aggravated Assault on a Peace Officer - Not Filed,Evading Arrest or Detention - Not Filed,Filed,Violent Crime Against Persons,,"Yes, mark all that apply",Discharged firearm,,Yes (select all that apply),No,No,No,No,No,Yes,Yes,No,Unknown,Yes,No,No,,,V_2016
2,16-4-P,ORIGINAL VERSION,1,12/14/16 15:27,,Submitted,TDCJ/Office of the Inspector General,"2503 Lake Road, Suite 5",Huntsville,TX,77340,Other,John,,West,Analou Sievers,analou.sievers@tdcj.texas.gov,Percy,,Froman,,12/19/1969,Male,Anglo or White,46,1/28/00 0:00,11/17/16 8:25,"No, evaluation not planned",Natural,,Colon Cancer,No,Not applicable,Not Applicable,,Pre-existing medical condition,"Not applicable, cause of death was illness/nat...",,8602 Peach Street,Lubbock,Lubbock,79404,Law Enforcement Facility,,Penitentiary,"TDCJ, specify",Montford,1/28/00 0:00,0,Medical facility,,No,Aggravated Robbery,,,Convicted,Violent Crime Against Persons,,No,,,No,,,,,,,,,,,,No,,,V_2016
3,16-5-C,ORIGINAL VERSION,1,12/14/16 18:37,,Submitted,San Antonio Police Dept.,315 S. Santa Rosa,San Antonio,TX,78207,Chief,William,,McManus,Leroy Carrion,leroy.carrion@sanantonio.gov,Andrew,,Moreno,,12/23/1991,Male,Hispanic or Latino,24,11/19/16 21:00,11/20/16 5:34,"Yes, results are available",Homicide (includes Justifiable Homicide),,Multiple gunshot wounds,No,Law enforcement/correctional personnel,Rifle/shotgun,,Not Applicable; cause of death was accidental ...,Firearm,,5814 Shadow Glen #4,San Antonio,Bexar,78238,Residence/Home,,Police Custody (pre-booking),Custody of Law Enforcement Personnel during/fl...,,11/19/16 21:00,0,Scene of incident,,Yes,Aggravated Assault with a Deadly Weapon,,,Filed,Violent Crime Against Persons,,"Yes, mark all that apply",Displayed firearm without discharge,,Yes (select all that apply),Unknown,Yes,No,No,Yes,No,Yes,No,Yes,No,No,No,,,V_2016
4,16-6-MJ,ORIGINAL VERSION,1,12/15/16 11:20,,Submitted,Rosenberg Police Dept.,2120 Fourth St.,Rosenberg,TX,77471,Chief,Dallis,,Warren,Charles Crocker,justin.crocker@rosenbergtx.gov,Roberto,Eduardo,Velasquez,,1/31/1977,Male,Hispanic or Latino,39,12/2/16 22:32,12/3/16 9:15,"Yes, results pending",Pending autopsy results,,Pending autopsy results.,No,Not applicable,Not Applicable,,Pre-existing medical condition,"Not applicable, cause of death was illness/nat...",,1910 Louise #41,Rosenberg,Fort Bend,77471,Residence/Home,,Municipal Jail,Jail - detox cell,,12/2/16 22:32,0,Medical facility,,No,TCIC warrant - Failure to Appear (Child Neglect),,,Filed,"Other, specify",Investigation of disturbance call at residence...,No,,,No,Yes,No,No,No,No,No,No,No,No,No,No,No,,,V_2016


<div style="background-color: #008;height: 20px"></div>

## 1c. Helper methods/code

In [30]:
def show_notnull_chart(cdr, vertical=False):
    '''Show which columns have null values, how often, and break down by form version.'''
    versions = sorted(set(cdr.form_version))
    frames = []
    for v in versions:
        fr = cdr[cdr.form_version == v]
        s = fr.notnull().mean()
        frames.append(s)
    frame = pd.concat(frames, axis=1)
    frame.columns = versions
    frame.sort_index(inplace=True)
    if not vertical:
        frame = frame.T
    return frame.style.background_gradient(cmap='RdYlGn', axis=(0 if vertical else 1))

In [31]:
OTHER_SPECIFY = 'OTHER, SPECIFY'

<div style="background-color: #080;height: 20px"></div>

# 2. Join disparate datasets, rename and reformat columns

<div style="background-color: #008;height: 15px"></div>

## 2a. Reduce the number of columns that we keep, and rename them more conveniently

#### Look at how often our columns are populated, broken down by form version (since each ask slightly different questions)

In [32]:
show_notnull_chart(cdr)

Unnamed: 0,Age At Time Of Death,Agency Address,Agency City,Agency County,Agency Name,Agency Phone,Agency State,Agency Zip,Appear intoxicated (alcohol or drugs),Attempt gain possession officer's weapon,Attempt to Injure Others?,Barricade self or initiate standoff?,CDR: CDR Name,City,County,Custody Date NA,Custody Type Facility,Date of Birth,Date/Time of Custody or Incident,Death Causer Other,Death Date and Time,Death Location,Death Location Elsewhere,Death Reason,Death Weapon Other Description,Decedent Display or Use Weapon Details,Decedent display/use of weapons,Department Type,Director First Name,Director Last Name,Director Middle Name,Director Salutation,Entry Behavior,Entry Date Time,Entry Date Time N/A,Escape or attempt to escape/flee custody,Ethnicity,Ethnicity Other,Exhibit any medical problems?,Exhibit any mental health problems?,First Name,Gain possession of officer's weapon,"Grab, hit or fight with the officer(s)",Injured By,Last Name,Location Category,Make suicidal statements?,Manner of Death,Manner of Death Description,Means of Death,Means of Death Other,Medical Cause of Death,Medical Examinor/Coroner Evalution?,Medical Treatment,Medical Treatment Description,Middle Name,Offense 1,Offense 2,Offense 3,Other Agencies Respond?,Other Behavior,Other Location Category,"Other device, specify","Other weapon, specify",Physically attempt/assault officer(s),Pre existing medical condition?,Race,Report Date,Report Time,Reporter Email,Reporter Name,Reporter Name Original CDR,Resist being handcuffed or arrested?,Sex,Specific Type of Custody/Facility,Specify Other Behavior,Specify Weapon Used,Status,Street Address,Suffix,TDCJ - Specify Unit,Threaten the officer(s) involved,Try to escape/flee from custody,Type of Custody,Type of Death Weapon,Type of Offense,"Type of Offense, Other",Type of Restraint,Type of weapon that caused death?,Under Restraint,Use weapon threaten/assault officer(s),Verbally threaten other(s) including law,Version Number,Version Type,Were the Charges:,Who caused the death?,Zip,form_version
V_2005,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0
V_2016,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Drop columns that we either don't need or haven't written code to clean yet.

In [33]:
keep_text = '''Both forms

- Age At Time Of Death
- Agency Address
- Agency City
- Agency Name
- Agency Zip
- CDR: CDR Name
- City
- County
- Date of Birth
- Date/Time of Custody or Incident
- Death Date and Time
- Death Location
- Death Location Elsewhere
- Entry Date Time
- Entry Date Time N/A
- First Name
- Middle Name
- Last Name
- Suffix
- Manner of Death
- Manner of Death Description
- Means of Death
- Means of Death Other
- Medical Cause of Death
- Medical Examinor/Coroner Evalution?
- Medical Treatment
- Offense 1
- Offense 2
- Offense 3
- Pre existing medical condition?
- Report Date
- Sex
- Specific Type of Custody/Facility
- Street Address
- Type of Custody
- Type of Offense
- Type of Offense, Other
- Version Number
- Version Type
- Were the Charges:
- Who caused the death?
- form_version
- Type of Restraint
- Under Restraint

2005 form only

- Agency County
- Custody Date NA
- Death Causer Other
- Department Type
- Entry Behavior
- Ethnicity
- Ethnicity Other
- Other Behavior
- Specify Other Behavior

2016 form only

- Exhibit any medical problems?
- Exhibit any mental health problems?
- Make suicidal statements?
- Race'''
keep_cols = []
drop_cols = set(cdr.columns)
for line in keep_text.splitlines():
    if line.startswith('- '):
        colname = line[2:]
        keep_cols.append(colname)
        drop_cols.remove(colname)

### Rename columns to be more machine friendly (lowercase, snake_case, and remove non-alphanumeric characters)

In [34]:
col_renames = {}
for c in keep_cols:
    new_name = ''.join([ch if ch.isalnum() else ' ' for ch in c.lower()])
    new_name = '_'.join(new_name.strip().split())
    col_renames[c] = new_name

In [35]:
cdr = cdr[list(col_renames.keys())]
cdr.rename(columns=col_renames, inplace=True)
cdr.head()

Unnamed: 0,age_at_time_of_death,agency_address,agency_city,agency_name,agency_zip,cdr_cdr_name,city,county,date_of_birth,date_time_of_custody_or_incident,death_date_and_time,death_location,death_location_elsewhere,entry_date_time,entry_date_time_n_a,first_name,middle_name,last_name,suffix,manner_of_death,manner_of_death_description,means_of_death,means_of_death_other,medical_cause_of_death,medical_examinor_coroner_evalution,medical_treatment,offense_1,offense_2,offense_3,pre_existing_medical_condition,report_date,sex,specific_type_of_custody_facility,street_address,type_of_custody,type_of_offense,type_of_offense_other,version_number,version_type,were_the_charges,who_caused_the_death,form_version,type_of_restraint,under_restraint,agency_county,custody_date_na,death_causer_other,department_type,entry_behavior,ethnicity,ethnicity_other,other_behavior,specify_other_behavior,exhibit_any_medical_problems,exhibit_any_mental_health_problems,make_suicidal_statements,race
0,40,350 W. Belknap,Fort Worth,Fort Worth Police Dept.,76102,PA05001C,Fort Worth,Tarrant,4/3/1964,1/1/05 15:04,2005-01-01 15:04:00,At the crime/arrest scene,,,1,Dino,,Gomez,,Justifiable Homicide,,Firearm,,Gunshot Wound to the Chest,"Yes, results are available",Not Applicable,Aggravated Assault,,,Not Applicable; cause of death was accidental ...,3/2/05 14:48,Male,Custody of Peace Officer during/fleeing arrest,1509 W. Hammond,Police Custody (pre-booking),,,1,ORIGINAL VERSION,Not filed at time of death,Law enforcement/correctional staff,V_2005,,No,TARRANT,0.0,,POLICE,,Hispanic,,0.0,,,,,
1,51,P. O. Box 1748,Austin,Travis County Sheriff's Dept.,78767,PA05002CJ,Del Valle,Travis,3/4/1953,10/20/04 20:56,2005-01-01 15:59:00,At medical facility,,10/20/04 20:56,0,Michael,Darnell,Dickson,,Natural Causes/Illness,Cardiac Arrest,Not applicable; cause of death was intoxicatio...,,Myocardial infarction,"Yes, results are available",Yes,Man/Del/Sell/Poss Controlled Substance,,,Pre-existing medical condition,3/3/05 14:07,Male,Jail - multiple occupancy cell,3614 Bill Price Road,County Jail,,,1,ORIGINAL VERSION,Filed,"Not applicable; cause of death was suicide, in...",V_2005,,No,TRAVIS,0.0,,SHERIFF,Medical,African-American,,0.0,,,,,
2,44,P.O. Box 4003,Huntsville,Texas Department Of Criminal Justice,773424003,PA05003P,Huntsville,Walker,12/3/1960,12/17/04 9:00,2005-01-02 04:35:00,At medical facility,,12/17/04 9:00,0,Darryl,Glenn,Wallace,,Natural Causes/Illness,Cardiac,Not applicable; cause of death was intoxicatio...,,Cardiac death secondary to cardiac ischemia th...,"Yes, results are available",Yes,Burglary of Habitation,,,Pre-existing medical condition,3/3/05 14:16,Male,TDCJ,21 FM 247,Penitentiary,,,1,ORIGINAL VERSION,Convicted,"Not applicable; cause of death was suicide, in...",V_2005,,No,TRAVIS,0.0,,STAGENCY,,African-American,,0.0,,,,,
3,21,"214 W. Nueva, Suite 331",San Antonio,San Antonio Police Dept.,78207,PA05004C,San Antonio,Bexar,11/28/1983,1/3/05 14:50,2005-01-03 15:00:00,At the crime/arrest scene,,,1,Albert,Chavez,Enriquez,,Justifiable Homicide,,Firearm,,Multiple Gunshot Wounds,"Yes, results are available",Not Applicable,Aggravated Robbery,,,Not Applicable; cause of death was accidental ...,3/3/05 14:40,Male,Custody of Peace Officer during/fleeing arrest,2102 Goliad Road,Police Custody (pre-booking),,,1,ORIGINAL VERSION,Not filed at time of death,Law enforcement/correctional staff,V_2005,,No,BEXAR,0.0,,POLICE,,Hispanic,,0.0,,,,,
4,26,"1200 Travis, 17th Floor",Houston,Houston Police Dept.,77002,PA05005C,Houston,Harris,12/30/1978,1/3/05 17:15,2005-01-03 17:15:00,At the crime/arrest scene,,,1,Alex,,Mendez,,Justifiable Homicide,,Firearm,,Multiple Gunshot Wounds,"Yes, results are available",Not Applicable,Traffic Violation,,,Not Applicable; cause of death was accidental ...,3/3/05 14:51,Male,Custody of Peace Officer during/fleeing arrest,3600 block of Telephone Rd,Police Custody (pre-booking),,Traffic Violation,1,ORIGINAL VERSION,Not filed at time of death,Law enforcement/correctional staff,V_2005,,No,HARRIS,0.0,,POLICE,,Hispanic,,0.0,,,,,


<div style="background-color: #008;height: 15px"></div>

## 2b. Add rows from the old master file to get BJS inpatient deaths (will dedup later)

In [36]:
# gc.drive.enable_team_drive('0ACeQWapAwOLqUk9PVA')
# titles = gc.spreadsheet_titles()
# ids = gc.spreadsheet_ids()

# cdr_id = [ids[index] for index in range(len(titles)) if titles[index] == 'reformatted_cdr_2017_master_file'][0]
# output = gc.open_by_key(cdr_id)

# wk = output[0]

# dfs = {}

# # drop empty columns or the concatenation does not work
# for sheet_name in ['Form Version 2005', 'Form Version 2016', 'Older Forms']:
#     sheet = [i for i in range(4) if output.worksheets()[i].title == sheet_name][0]
#     wk = output[sheet]
#     try:
#         dfs[sheet_name] = wk.get_as_df(include_tailing_empty = True).drop(columns = [''], axis= 1)
#     except KeyError:
#         dfs[sheet_name] = wk.get_as_df(include_tailing_empty = True)

In [37]:
datasets = dw.load_dataset(DW_PROJECT_RAW_AND_PROCESSING, force_update=True)
old_master = datasets.dataframes['reformatted_cdr_2017_master_file']
old_master['form_version'] = 'V_BJS'
print(old_master.shape)
old_master.head()

(7735, 47)


Unnamed: 0,first_name,middle_name,last_name,suffix,age_at_time_of_death,sex,date_of_birth,offense_1,offense_2,offense_3,report_date,death_date_and_time,cdr_cdr_name,agency_name,agency_address,agency_city,agency_zip,agency_county,who_caused_the_death,death_causer_other,date_time_of_custody_or_incident,custody_date_na,type_of_custody,specific_type_of_custody_facility,street_address,city,county,death_location,death_location_elsewhere,means_of_death,means_of_death_other,manner_of_death,manner_of_death_description,entry_date_time,entry_date_time_n_a,medical_cause_of_death,pre_existing_medical_condition,medical_examinor_coroner_evalution,medical_treatment,ethnicity,ethnicity_other,other_behavior,entry_behavior,were_the_charges,type_of_offense,type_of_offense_other,form_version
0,Clyde,L,Pady,,80.0,Male,2024-10-01,SEXUAL ASSLT/RAPE OF CHILD,SEXUAL ASSLT/RAPE OF CHILD,SEXUAL ASSLT/RAPE OF CHILD,3/1/2005 0:00,2005-01-02,PRISON20051001,Texas Department Of Criminal Justice,,,,,,,5/30/1995 0:00,,Penitentiary,TDCJ,,Tenessee Colony,,,,,,Natural Causes/Illness,Prostate/Lung Cancer,,,,Developed condition after admission,No,Yes,Anglo or White,,,,Convicted,child,,V_BJS
1,Winston,,Wilkins,,61.0,Male,1943-09-27,"STOLEN VEHICLE, UNAUTHORIZED USE",0,0,3/1/2005 0:00,2005-01-02,PRISON20051002,Texas Department Of Criminal Justice,,,,,,,9/8/2003 0:00,,Penitentiary,TDCJ,,Galveston,,,,,,Natural Causes/Illness,Lung Cancer,,,,Pre-existing medical condition,No,Yes,Anglo or White,,,,Convicted,property,,V_BJS
2,Steven,,Fleming,,37.0,Male,1967-10-30,SEXUAL ASSLT/RAPE AGG/SEX ABUSE,0,0,3/1/2005 0:00,2005-01-04,PRISON20051004,Texas Department Of Criminal Justice,,,,,,,10/7/1988 0:00,,Penitentiary,TDCJ,,Livingston,,,,,,Natural Causes/Illness,,,,,Pre-existing medical condition,No,Yes,Black or African American,,,,Convicted,violent,,V_BJS
3,James,B,Walker,,61.0,Male,1943-08-27,AGG SEXUAL ASSLT,AGG KIDNAPPING,0,3/1/2005 0:00,2005-01-04,PRISON20051005,Texas Department Of Criminal Justice,,,,,,,4/23/2004 0:00,,Penitentiary,TDCJ,,Tenessee Colony,,,,,,Natural Causes/Illness,Hepatocellular Cancer,,,,Developed condition after admission,No,Yes,Black or African American,,,,Convicted,violent,,V_BJS
4,David,L,Jones,,45.0,Male,1960-05-01,INDECENCY W/ A CHILD SEXUAL CONTACT,0,0,3/1/2005 0:00,2005-01-06,PRISON20051007,Texas Department Of Criminal Justice,,,,,,,11/8/2001 0:00,,Penitentiary,TDCJ,,Abilene,,,,,,Natural Causes/Illness,Cardiomyopathy Heart Failure,,,,Pre-existing medical condition,No,Yes,Anglo or White,,,,Convicted,child,,V_BJS


In [38]:
cdr.shape, old_master.shape

((9914, 57), (7735, 47))

#### There will be a few columns that the old master file doesn't have, which is fine

In [39]:
set(cdr.columns) - set(old_master.columns)

{'department_type',
 'exhibit_any_medical_problems',
 'exhibit_any_mental_health_problems',
 'make_suicidal_statements',
 'race',
 'specify_other_behavior',
 'type_of_restraint',
 'under_restraint',
 'version_number',
 'version_type'}

#### However, it shouldn't have any novel columns that the *other* data doesn't have

In [40]:
assert(len(set(old_master.columns) - set(cdr.columns)) == 0)

### Merge the "old master" file with the other CDR records

In [41]:
cdr = pd.concat([cdr, old_master])
cdr.reset_index(inplace=True, drop=True)
cdr.sort_values('form_version', inplace=True)
cdr.head()

Unnamed: 0,age_at_time_of_death,agency_address,agency_city,agency_name,agency_zip,cdr_cdr_name,city,county,date_of_birth,date_time_of_custody_or_incident,death_date_and_time,death_location,death_location_elsewhere,entry_date_time,entry_date_time_n_a,first_name,middle_name,last_name,suffix,manner_of_death,manner_of_death_description,means_of_death,means_of_death_other,medical_cause_of_death,medical_examinor_coroner_evalution,medical_treatment,offense_1,offense_2,offense_3,pre_existing_medical_condition,report_date,sex,specific_type_of_custody_facility,street_address,type_of_custody,type_of_offense,type_of_offense_other,version_number,version_type,were_the_charges,who_caused_the_death,form_version,type_of_restraint,under_restraint,agency_county,custody_date_na,death_causer_other,department_type,entry_behavior,ethnicity,ethnicity_other,other_behavior,specify_other_behavior,exhibit_any_medical_problems,exhibit_any_mental_health_problems,make_suicidal_statements,race
0,40.0,350 W. Belknap,Fort Worth,Fort Worth Police Dept.,76102,PA05001C,Fort Worth,Tarrant,4/3/1964,1/1/05 15:04,2005-01-01 15:04:00,At the crime/arrest scene,,,1,Dino,,Gomez,,Justifiable Homicide,,Firearm,,Gunshot Wound to the Chest,"Yes, results are available",Not Applicable,Aggravated Assault,,,Not Applicable; cause of death was accidental ...,3/2/05 14:48,Male,Custody of Peace Officer during/fleeing arrest,1509 W. Hammond,Police Custody (pre-booking),,,1.0,ORIGINAL VERSION,Not filed at time of death,Law enforcement/correctional staff,V_2005,,No,TARRANT,0.0,,POLICE,,Hispanic,,0,,,,,
3437,33.0,133 N. Riverfront Blvd.,Dallas,Dallas County Sheriff's Dept.,75207,PA14199CJ,Dallas,Dallas,3/23/1981,4/17/14 9:49,2014-04-20 08:14:00,At medical facility,,4/17/14 9:49,0,Alisha,,Skeats,,Natural Causes/Illness,Complications of chronic opiate use,Not applicable; cause of death was intoxicatio...,,"Complications of chronic opiate, benzodiazepin...","Yes, results are available",Yes,Theft of Service,,,Don't know,4/28/14 9:27,Female,Hospital/Infirmary,Parkland Hospital,County Jail,,property crime,1.0,ORIGINAL VERSION,Filed,"Not applicable; cause of death was suicide, in...",V_2005,,No,Dallas,0.0,,SHERIFF,,Anglo,,0,,,,,
3436,64.0,"2503 Lake Road, Suite 5",Huntsville,Texas Department Of Criminal Justice,77340,PA14198P,Tennessee Colony,Anderson,11/2/1949,10/6/98 0:00,2014-04-23 09:07:00,At law enforcement facility,,10/6/98 0:00,0,Orville,Lee,McCoy,,Natural Causes/Illness,Malignant Pleural Effusion GT Bleesing/Anemia,Not applicable; cause of death was intoxicatio...,,ESLD-Liver Disease,"No, evaluation not planned",Yes,Aggravated Sexual Assault of a Child,,,Don't know,4/25/14 15:15,Male,TDCJ,"Michael Unit, 2664 FM 2054",Penitentiary,,,1.0,ORIGINAL VERSION,Convicted,"Not applicable; cause of death was suicide, in...",V_2005,,No,Walker,0.0,,STAGENCY,,Anglo,,0,,,,,
3435,53.0,"2503 Lake Road, Suite 5",Huntsville,Texas Department Of Criminal Justice,77340,PA14197P,New Boston,Bowie,5/11/1960,3/19/12 0:00,2014-04-16 06:44:00,At law enforcement facility,,4/25/12 0:00,0,Pedro,,Juarez,Jr,Natural Causes/Illness,Cardiac arrhythmia,Not applicable; cause of death was intoxicatio...,,Sudden unexpected death in schizophrenia due t...,"Yes, results are available",Not Applicable,Aggravated Assault,,,Don't know,4/25/14 13:25,Male,TDCJ,3899 State Hwy 98,Penitentiary,,,1.0,ORIGINAL VERSION,Convicted,"Not applicable; cause of death was suicide, in...",V_2005,,No,Walker,0.0,,STAGENCY,,Hispanic,,0,,,,,
3434,44.0,"2503 Lake Road, Suite 5",Huntsville,Texas Department Of Criminal Justice,77340,PA14196P,Richmon,Fort Bend,10/3/1969,2/3/14 0:00,2014-04-23 13:35:00,At medical facility,,2/3/14 0:00,0,James,Neal,Benjamin,,Natural Causes/Illness,Thromboembolism,Not applicable; cause of death was intoxicatio...,,Acute pulmonary thromboembolism,"Yes, results are available",Not Applicable,Felony Theft ( 1yr probation) SAFPF,,,Don't know,4/25/14 11:14,Male,TDCJ,Jester 4 Unit-4 Jester Rd.,Penitentiary,,Theft,1.0,ORIGINAL VERSION,Probation/Parole,"Not applicable; cause of death was suicide, in...",V_2005,,No,Walker,0.0,,STAGENCY,,African-American,,0,,,,,


<div style="background-color: #008;height: 15px"></div>

## 2c. Data is merged into one frame. Now do column type conversions

### Convert date columns

In [42]:
s1 = cdr.dtypes
convert_date_cols(cdr)
s2 = cdr.dtypes
different = s1[s1 != s2].index.tolist()
print("Changed %d cols to datetime (from some other dtype):" % len(different), different)

Converting column date_of_birth to datetime
Converting column date_time_of_custody_or_incident to datetime
Converting column death_date_and_time to datetime
Converting column entry_date_time to datetime
Replaced 1361 bad values with NaT:
Unique bad values: {'0000-00-00 00:00:00'}
Converting column report_date to datetime
Changed 4 cols to datetime (from some other dtype): ['date_of_birth', 'date_time_of_custody_or_incident', 'entry_date_time', 'report_date']


### As shown below, some float columns are in string format. We'll fix that next.

In [43]:
cdr.entry_date_time_n_a.value_counts().head()

0           10098
1            3931
12:00 AM      354
               30
11:00 AM        1
Name: entry_date_time_n_a, dtype: int64

In [44]:
def float_or_nan(val):
    try:
        return float(val)
    except ValueError:
        pass
    except TypeError:
        pass
    print("- BAD VALUE (returning NaN):", val)
    return np.NAN

float_cols = [
 'age_at_time_of_death',
 'agency_zip',
 'custody_date_na',
 'entry_date_time_n_a',
 'version_number'
]

for c in float_cols:
    print("Converting", c)
    cdr[c] = cdr[c].apply(float_or_nan).astype(float)

Converting age_at_time_of_death
Converting agency_zip
- BAD VALUE (returning NaN): 78/155
Converting custody_date_na
Converting entry_date_time_n_a
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:22 AM
- BAD VALUE (returning NaN): 3:25 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (returning NaN): 12:00 AM
- BAD VALUE (retu

### Upcase string cell contents

In [45]:
upcase_strip_string_cells(cdr)

### Add a `death_date` column (without the death time)

In [46]:
cdr['death_date'] = pd.to_datetime(cdr.death_date_and_time.apply(lambda dt: datetime.date(dt.year, dt.month, dt.day)))

In [47]:
cdr_ready_to_clean = cdr.copy()

<div style="background-color: #080;height: 20px"></div>

# 3. Begin deep cleaning

In [48]:
# When we're working on modifying this file, and we need to reset
# the dataset, it's annoying to run the whole notebook, fetch from
# data.world, etc. So you can just rerun from here downward.
cdr = cdr_ready_to_clean.copy()

<div style="background-color: #008;height: 15px"></div>

## 3a.  Standardize `race` values

### First we need to merge race and ethnicity columns (the 2005 form calls it 'ethnicity', the 2016 'race')

#### Have a look at the values first

In [49]:
cdr.race.value_counts()

ANGLO OR WHITE                      2024
HISPANIC OR LATINO                  1360
BLACK OR AFRICAN AMERICAN           1331
ASIAN OR PACIFIC ISLANDER             27
OTHER                                 17
AMERICAN INDIAN OR ALASKA NATIVE       4
UNKNOWN                                1
Name: race, dtype: int64

In [50]:
cdr.ethnicity.value_counts()

ANGLO OR WHITE                      3231
BLACK OR AFRICAN AMERICAN           2275
HISPANIC OR LATINO                  2119
ANGLO                               2115
AFRICAN-AMERICAN                    1521
HISPANIC                            1433
ASIAN                                 63
OTHER/DON'T KNOW                      46
AMERICAN INDIAN/ALASKA NATIVE         26
MIDDLE EAST                           24
OTHER                                 22
NATIVE HAWAIIAN/PACIFIC ISLANDER      10
Name: ethnicity, dtype: int64

In [51]:
# Collapse the various OTHER-like values
cdr.loc[cdr.ethnicity.fillna('').str.contains('OTHER'), 'ethnicity'] = 'OTHER'

In [52]:
# When choosing the 'Other' ethnicity in the 2005 form version, there
# is a subsequent field to specify. Though clearly some of them are not
# truly 'other' ethnicities. See:
cdr.ethnicity_other.value_counts()

                       5128
0                       665
WHITE                    14
CAUCASIAN                 8
UNKNOWN                   8
SUDANESE BLACK            2
WHITE NON HISPANIC        2
ARABIAN                   2
ANGLO & MIDDLE EAST       2
CUBAN                     2
BELIZE, BRAZILIAN         1
UNITED KINGDOM            1
Name: ethnicity_other, dtype: int64

In [53]:
# Replace the nonsensical 0 with NaN
cdr.loc[cdr.ethnicity_other.astype(str) == '0', 'ethnicity_other'] = None

In [54]:
# Let's make sure nobody is filling out the "other ethnicity" column when they shouldn't...
cdr[(~(cdr.ethnicity == 'OTHER') & cdr.ethnicity_other.notnull())][['ethnicity', 'ethnicity_other']]

Unnamed: 0,ethnicity,ethnicity_other
0,HISPANIC,
3437,ANGLO,
3436,ANGLO,
3435,HISPANIC,
3434,AFRICAN-AMERICAN,
...,...,...
15736,ANGLO OR WHITE,CAUCASIAN
12866,ANGLO OR WHITE,WHITE
13621,ANGLO OR WHITE,WHITE NON HISPANIC
13554,ANGLO OR WHITE,WHITE


In [55]:
# Good. Let's transfer those specified ethnicity_other values into
# the 'ethnicity' column, so we can merge everything at once.
other_eth = (cdr.ethnicity == 'OTHER')
print('Merging %d "ethnicity_other" values into the main "ethnicity" column' % other_eth.sum())
cdr.loc[other_eth, 'ethnicity'] = cdr.ethnicity_other[other_eth]
cdr.drop('ethnicity_other', axis=1, inplace=True)

Merging 68 "ethnicity_other" values into the main "ethnicity" column


In [56]:
# Make a single 'race' column that has merged, simplified values of race or ethnicity.
race_eth_list = []
for race, eth in zip(cdr.race, cdr.ethnicity):
    # Only one of (race, eth) should be set
    assert pd.isnull(race) or pd.isnull(eth)
    if pd.isnull(race):
        if pd.isnull(eth):
            race_eth_list.append(None)
            continue
        x = eth
    else:
        x = race
    race_eth_list.append(x)

In [57]:
cdr['race'] = race_eth_list
cdr.drop('ethnicity', axis=1, inplace=True)
cdr.race.value_counts()

ANGLO OR WHITE                      5255
BLACK OR AFRICAN AMERICAN           3606
HISPANIC OR LATINO                  3479
ANGLO                               2115
AFRICAN-AMERICAN                    1521
HISPANIC                            1433
ASIAN                                 63
ASIAN OR PACIFIC ISLANDER             27
AMERICAN INDIAN/ALASKA NATIVE         26
MIDDLE EAST                           24
OTHER                                 17
NATIVE HAWAIIAN/PACIFIC ISLANDER      10
UNKNOWN                                9
WHITE                                  8
CAUCASIAN                              4
AMERICAN INDIAN OR ALASKA NATIVE       4
CUBAN                                  2
ANGLO & MIDDLE EAST                    2
ARABIAN                                2
UNITED KINGDOM                         1
SUDANESE BLACK                         1
WHITE NON HISPANIC                     1
BELIZE, BRAZILIAN                      1
Name: race, dtype: int64

### Collapse `race` values into a smaller set (white/hispanic/black/other)

In [58]:
race_before = cdr.race.copy()
race_before.name = 'race_before'

In [59]:
standardize_race_cols(cdr)
cdr.race.value_counts()

WHITE       7385
BLACK       5128
HISPANIC    4912
OTHER        186
Name: race, dtype: int64

In [60]:
cdr.groupby([race_before, 'race']).size()

race_before                       race    
AFRICAN-AMERICAN                  BLACK       1521
AMERICAN INDIAN OR ALASKA NATIVE  OTHER          4
AMERICAN INDIAN/ALASKA NATIVE     OTHER         26
ANGLO                             WHITE       2115
ANGLO & MIDDLE EAST               WHITE          2
ANGLO OR WHITE                    WHITE       5255
ARABIAN                           OTHER          2
ASIAN                             OTHER         63
ASIAN OR PACIFIC ISLANDER         OTHER         27
BELIZE, BRAZILIAN                 OTHER          1
BLACK OR AFRICAN AMERICAN         BLACK       3606
CAUCASIAN                         WHITE          4
CUBAN                             OTHER          2
HISPANIC                          HISPANIC    1433
HISPANIC OR LATINO                HISPANIC    3479
MIDDLE EAST                       OTHER         24
NATIVE HAWAIIAN/PACIFIC ISLANDER  OTHER         10
OTHER                             OTHER         17
SUDANESE BLACK                    BLACK

<div style="background-color: #008;height: 15px"></div>

## 3b. Standardize agency information

In [61]:
# Standardize agency name (so we can join/compare across datasets)
cdr['agency_name'] = cdr['agency_name'].apply(standardize_agency_name)

# Lookup county name by agency name. If this fails, fall back
# on the county specified in the form, if it exists.
cdr['agency_county'] = cdr['agency_county'].str.upper()
county_lookup = cdr['agency_name'].apply(lambda name: agency_county.get(name, np.nan))
cdr['agency_county'] = county_lookup.fillna(cdr['agency_county'])

# Manually handle one major agency
cdr.loc[cdr['agency_name'] == 'TEXAS DEPT OF CRIMINAL JUSTICE', 'agency_county'] = 'STATE'

In [62]:
# Check that we are only missing counties for a paltry few records now.
cdr[cdr['agency_county'].isnull()]['agency_name'].value_counts()

ECTOR CO SHERIFFS OFC                                       7
TEXAS JUVENILE JUSTICE DEPT OFFICE OF INSPECTOR GENERAL     3
TEXARKANA COMMUNITY COLLEGE POLICE DEPT                     2
HILDAGO CO SHERIFFS OFFICE                                  2
DALLASFW AIRPORT DPS                                        2
OFFICE OF THE ATTY GENERAL                                  2
TEST CDR AGENCY                                             2
ROWLETT POLICE DEPT CO ROWLETT                              1
RENO POLICE DEPT PARKER CO                                  1
METROPOLITAN TRANSIT AUTH OF HARRIS CO                      1
NORTHSIDE INDEPENDENT SCHOOL DIST POLICE                    1
GULF COAST VIOLENT OFFENDERS TASK FORCE                     1
SUGARLAND POLICE DEPT                                       1
NACONA POLICE DEPT                                          1
GEO GROUP EAGLE PASS CORRECTIONAL FACILITY                  1
BEAUMONT INDEPENDENT SCHOOL DIST POLICE DEPT                1
ELLIS CO

In [63]:
# Clearly 'TEST CDR AGENCY' is meant to be ignored
test_agencies = cdr['agency_name'] == 'TEST CDR AGENCY'
cdr = cdr[~test_agencies]
print("Dropping %d records from 'TEST CDR AGENCY', leaving %d records" % (test_agencies.sum(), len(cdr)))

Dropping 2 records from 'TEST CDR AGENCY', leaving 17647 records


<div style="background-color: #008;height: 15px"></div>

## 3c.  Fix death-information related columns

#### The two form versions offer slightly different options (e.g. V_2005 uses 'AT MEDICAL FACILITY' while V_2016 uses 'MEDICAL FACILITY'). Collapse values to stabilize the options across form versions, and roll up rare values into a single 'OTHER' value.

### Fix: `death_location`

In [64]:
pd.crosstab(cdr.death_location, cdr.form_version)

form_version,V_2005,V_2016,V_BJS
death_location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,382,0
1,0,30,0
AT LAW ENFORCEMENT FACILITY,1269,0,1134
AT MEDICAL FACILITY,3053,0,2682
AT THE CRIME/ARREST SCENE,673,0,565
DEAD ON ARRIVAL AT MEDICAL FACILITY,0,77,0
ELSEWHERE,97,0,94
"ELSEWHERE, SPECIFY",1,64,0
EN ROUTE TO BOOKING CENTER/POLICE LOCKUP,2,3,2
EN ROUTE TO MEDICAL FACILITY,54,0,50


In [65]:
replacements = {
    'AT MEDICAL FACILITY': 'MEDICAL FACILITY',
    'AT LAW ENFORCEMENT FACILITY': 'LAW ENFORCEMENT FACILITY',
    'AT THE CRIME/ARREST SCENE': 'CRIME/ARREST SCENE',
    'SCENE OF INCIDENT': 'CRIME/ARREST SCENE',
    'LAW ENFORCEMENT FACILITY/BOOKING CENTER': 'LAW ENFORCEMENT FACILITY',
    'DEAD ON ARRIVAL AT MEDICAL FACILITY': 'EN ROUTE TO MEDICAL FACILITY',
    'EN ROUTE TO BOOKING CENTER/POLICE LOCKUP': 'EN ROUTE TO LAW ENFORCEMENT FACILITY',
    'ELSEWHERE': OTHER_SPECIFY,
    'ELSEWHERE, SPECIFY': OTHER_SPECIFY,
}
cdr['death_location'] = cdr['death_location'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr.death_location, cdr.form_version)

AttributeError: 'int' object has no attribute 'strip'

### Fix: `means_of_death`

In [None]:
pd.crosstab(cdr.means_of_death, cdr.form_version)

In [None]:
replacements = {
    'NOT APPLICABLE, CAUSE OF DEATH WAS ILLNESS/NATURAL CAUSE': 'NOT APPLICABLE',
    'NOT APPLICABLE; CAUSE OF DEATH WAS INTOXICATION OR ILLNESS/NATURAL CAUSES': 'NOT APPLICABLE',
    'OTHER': OTHER_SPECIFY,
    'KNIFE, CUTTING INSTRUMENT': 'KNIFE / EDGED INSTRUMENT',
    'BLUNT INSTRUMENT': 'BATON / BLUNT INSTRUMENT',
    "DON'T KNOW": 'UNKNOWN',
    "DON\\'T KNOW": 'UNKNOWN',
    'RIFLE/SHOTGUN': 'FIREARM',
}
cdr['means_of_death'] = cdr['means_of_death'].apply(lambda x:  None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr.means_of_death, cdr.form_version)

### It seems officers sometimes misuse the 'other' option, selecting it when another category is more appropriate:

In [None]:
cdr[cdr.means_of_death == OTHER_SPECIFY]['means_of_death_other'].value_counts().head()

### Let's roll all rare and 'unknown'-type values into the 'other' category. 'VEHICLE ACCIDENT' in particular is new in the 2016 form, only indicated as 'other' in older forms.

In [None]:
other_values = ['UNKNOWN', 'VEHICLE ACCIDENT', 'KNIFE / EDGED INSTRUMENT', 'BATON / BLUNT INSTRUMENT']
indices = cdr['means_of_death'].isin(other_values)
cdr.loc[indices, 'means_of_death_other'] = cdr.loc[indices, 'means_of_death']
cdr.loc[indices, 'means_of_death'] = OTHER_SPECIFY

pd.crosstab(cdr.means_of_death, cdr.form_version)

### Fix: `manner_of_death` and  `manner_of_death_description`

In [None]:
pd.crosstab(cdr.form_version, cdr.manner_of_death).T

In [None]:
pd.crosstab(cdr.manner_of_death, cdr.manner_of_death_description.notnull())

In [None]:
replacements = {
    'NATURAL': 'NATURAL CAUSES/ILLNESS',
    'JUSTIFIABLE HOMICIDE': 'HOMICIDE',
    'HOMICIDE BY LAW ENFORCEMENT/CORRECTIONAL STAFF': 'HOMICIDE',
    'OTHER HOMICIDE': 'HOMICIDE',
    'HOMICIDE (INCLUDES JUSTIFIABLE HOMICIDE)': 'HOMICIDE',
    'ACCIDENTAL INJURY CAUSED BY OTHERS': 'ACCIDENTAL',
    'ACCIDENTAL INJURY TO SELF': 'ACCIDENTAL',
    'OTHER': OTHER_SPECIFY,
    'OTHER - SPECIFY': OTHER_SPECIFY,
}
cdr['manner_of_death'] = cdr['manner_of_death'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))

# In past versions, "pending autopsy results" was not an option, and reports had "other"
# checked with some mention of pending autopsy in the free field. We emulate this here
# to preserve consistency across form versions.
other_values = ['PENDING AUTOPSY RESULTS', 'COULD NOT BE DETERMINED']
indices = cdr['manner_of_death'].isin(other_values)
cdr.loc[indices, 'manner_of_death_description'] = cdr.loc[indices, 'manner_of_death']
cdr.loc[indices, 'manner_of_death'] = OTHER_SPECIFY

pd.crosstab(cdr.form_version, cdr.manner_of_death).T

### A few suicides by hanging have the wrong `means_of_death`

In [None]:
frame = cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death != 'HANGING, STRANGULATION')]
frame = frame[(frame.medical_cause_of_death.fillna('').str.contains('HANGING')) |
              frame.manner_of_death_description.fillna('').str.contains('HANGING')]
print(len(frame))
frame[['medical_cause_of_death', 'means_of_death', 'means_of_death_other', 'manner_of_death', 'manner_of_death_description']].head()

In [None]:
cdr.loc[frame.index, 'means_of_death'] = 'HANGING, STRANGULATION'

# Be sure we got them all
frame = cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death != 'HANGING, STRANGULATION')]
frame = frame[frame.medical_cause_of_death.fillna('').str.contains('HANGING')]
assert len(frame) == 0

### For a few suicides, the `manner_of_death` is 'NOT APPLICABLE', which makes no sense. We need to change these to OTHER, though (as shown below) more is going on

In [None]:
frame = cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death == 'NOT APPLICABLE')]
print(len(frame))
frame[['means_of_death', 'means_of_death_other', 'manner_of_death', 'manner_of_death_description', 'medical_cause_of_death']].head()

In [None]:
cdr.loc[frame.index, 'means_of_death'] = OTHER_SPECIFY
assert len(cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death == 'NOT APPLICABLE')]) == 0

### Fix: `pre_existing_medical_condition` 

In [None]:
pd.crosstab(cdr.form_version, cdr.pre_existing_medical_condition).T

In [None]:
replacements = {
    'DECEASED DEVELOPED CONDITION AFTER ADMISSION': 'DEVELOPED CONDITION AFTER ADMISSION',
    "DON'T KNOW": 'UNKNOWN',
    "DON\\'T KNOW": 'UNKNOWN',
    'NOT APPLICABLE; CAUSE OF DEATH WAS ACCIDENTAL INJURY, INTOXICATION, SUICIDE OR HOMICIDE': 'NOT APPLICABLE',
    'COULD NOT BE DETERMINED': 'UNKNOWN',
    'PRE-EXISTING MEDICAL CONDITION': 'YES',
}
cdr['pre_existing_medical_condition'] = cdr['pre_existing_medical_condition'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr.form_version, cdr.pre_existing_medical_condition).T

### Fix: `who_caused_the_death` 

NOTE: This question is framed as follows:
* 2005 form: "If the death was an accident or homicide, who caused the death?"
* 2016 form: "If the death was an accident, homicide **or suicide**, who caused the death?" (emphasis added)

Thus, we need to:
1. Collapse near-identical values from different forms, similar to the other areas here.
1. Remove suicides from the 2016 responses, as they skew the data (see below). While we're add it, change ANY entries that are not of type homicide/suicide to have "NOT APPLICABLE" as the value.

#### Collapse near-identical values from different forms, similar to the other areas here.

In [None]:
pd.crosstab(cdr.who_caused_the_death, cdr.death_date_and_time.dt.year)

In [None]:
replacements = {
    'DECEASED': 'DECEDENT',
    "DON'T KNOW": 'UNKNOWN',
    "DON\\'T KNOW": 'UNKNOWN',
    'LAW ENFORCEMENT/CORRECTIONAL STAFF': 'LAW ENFORCEMENT/CORRECTIONAL PERSONNEL',
    'NOT APPLICABLE; CAUSE OF DEATH WAS SUICIDE, INTOXICATION OR ILLNESS/NATURAL CAUSES': 'NOT APPLICABLE',
    'OTHER DETAINEES': 'OTHER DETAINEE(S)',
    'OTHER PERSONS': 'OTHER CIVILIAN(S)',
    'ACCIDENTAL INJURY TO SELF': 'ACCIDENTAL',
    'UNKNOWN PERSON(S) CAUSED THE INJURY': 'UNKNOWN',
    'UNKNOWN WHETHER DECEDENT SUSTAINED A FATAL INJURY': 'UNKNOWN',
}
cdr['who_caused_the_death'] = cdr['who_caused_the_death'].apply(lambda x:  None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['who_caused_the_death'], cdr.form_version)

#### Remove suicides from the 2016 responses to 'who caused the death' question, as in previous years 'who caused the death' was only asked for homicides and accidents. (see below)

In [None]:
pd.crosstab(cdr.who_caused_the_death, cdr.manner_of_death)

In [None]:
cdr.loc[~cdr.manner_of_death.isin(['HOMICIDE', 'ACCIDENTAL', OTHER_SPECIFY]), 'who_caused_the_death'] = 'NOT APPLICABLE'
pd.crosstab(cdr.who_caused_the_death, cdr.manner_of_death)

### Fix `medical_examinor_coroner_evalution`

In [None]:
pd.crosstab(cdr.form_version, cdr.medical_examinor_coroner_evalution).T

In [None]:
def fix_coroner(result):
    if pd.isnull(result):
        return None
    result = result.strip()
    if result.startswith('YES'):
        return 'YES'
    elif result.startswith('NO'):
        return 'NO'
    return None

cdr['medical_examinor_coroner_evalution'] = cdr['medical_examinor_coroner_evalution'].apply(fix_coroner)
pd.crosstab(cdr.form_version, cdr.medical_examinor_coroner_evalution).T

<div style="background-color: #008;height: 15px"></div>

## 3d.  Fix other columns one-by-one

### Fix `gender` related columns

In [None]:
standardize_gender_cols(cdr)

### Fix: `were_the_charges` 

In [None]:
pd.crosstab(cdr['were_the_charges'], cdr.form_version)

In [None]:
replacements = {
    'CAPITAL MURDER': 'CONVICTED',
    'PROBATION/PAROLE': 'PROBATION/PAROLE VIOLATION',
    'A PROBATION/PAROLE VIOLATION': 'PROBATION/PAROLE VIOLATION',
}
cdr['were_the_charges'] = cdr['were_the_charges'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['were_the_charges'], cdr.form_version)

### Fix `type_of_custody`

In [None]:
pd.crosstab(cdr['type_of_custody'], cdr.form_version)

In [None]:
replacements = {
    'PRE-CUSTODIAL USE OF FORCE': 'POLICE CUSTODY (PRE-BOOKING)',
    'PRIVATE CORRECTIONAL FACILITY': 'PRIVATE FACILITY',
    'COUNTY JAIL': 'JAIL - COUNTY',
    'MUNICIPAL JAIL': 'JAIL - MUNICIPAL',
    'PENITENTIARY': 'PRISON',
}
cdr['type_of_custody'] = cdr['type_of_custody'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['type_of_custody'], cdr.form_version)

### Fix `specific_type_of_custody_facility`

In [None]:
pd.crosstab(cdr.specific_type_of_custody_facility, cdr.form_version)

In [None]:
replacements = {
    'TDCJ, SPECIFY': 'TDCJ',
    'CUSTODY OF PEACE OFFICER DURING/FLEEING ARREST': 'CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FLEEING ARREST',
    'CUSTODY OF PEACE OFFICER SUBSEQUENT TO ARREST': 'CUSTODY OF LAW ENFORCEMENT PERSONNEL AFTER ARREST',
    'CUSTODY OF LAW ENFORCEMENT PERSONNEL SUBSEQUENT TO ARREST': 'CUSTODY OF LAW ENFORCEMENT PERSONNEL AFTER ARREST',
    'TEXAS-JUVENILE JUSTICE DEPARTMENT - FACILITY/DETENTION CENTER, SPECIFY': 'OTHER',
    'TJPC': 'OTHER',
    'TYC': 'OTHER',
    'HALFWAY HOUSE/RESTITUTION CENTER': 'OTHER',
    'CORRECTIONAL/REHABILITATION FACILITY': 'OTHER',
    'NON-LAW ENFORCEMENT DETOX FACILITY': 'OTHER',
}
cdr['specific_type_of_custody_facility'] = cdr['specific_type_of_custody_facility'].apply(
    lambda x: x if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['specific_type_of_custody_facility'], cdr.form_version)

### Glance at the types of custody crossed with facility

In [None]:
pd.crosstab(cdr.specific_type_of_custody_facility, cdr.type_of_custody)

### `other_behavior` and `specify_other_behavior` are very annoying and misused

In [None]:
s = cdr['other_behavior'].value_counts()
print("%d records have 'other_behavior' set (%d do not)" % (s.sum(), len(cdr) - s.sum()))
print("%d unique values for 'other_behavior' - sample below:" % len(set(s)))
s.head(10)

In [None]:
s = cdr['specify_other_behavior'].value_counts()
print("%d records have 'specify_other_behavior' set (%d do not)" % (s.sum(), len(cdr) - s.sum()))
print("%d unique values for 'specify_other_behavior' - sample below:" % len(set(s)))
s.head(10)

### It _appears_ that anytime other_behavior is 1.0, specify_other_behavior is set:

In [None]:
cdr[cdr['specify_other_behavior'].notnull()]['other_behavior'].fillna('(None)').value_counts()

### Alright, we can just overwrite 'other_behavior' with 'specify_other_behavior' when the latter is given

In [None]:
behavior = []
for other, specify in zip(cdr['other_behavior'], cdr['specify_other_behavior']):
    if pd.notnull(specify):
        behavior.append(specify)
    else:
        try:
            other = float(other)
            if other == 0.0:
                behavior.append(None)
                continue
        except TypeError:
            pass
        except ValueError:
            pass
        behavior.append(other)

cdr['other_behavior'] = pd.Series(behavior, index=cdr.index)
cdr.drop('specify_other_behavior', axis=1, inplace=True)

cdr['other_behavior'].fillna('(None)').value_counts().head(10)

<div style="background-color: #008;height: 15px"></div>

## 3e. Identify and drop a range of unnecessary columns

### `entry_date_time_n_a` and `custody_date_na` are useless - we can infer them from (duh) an NA value in entry_date_time or custody_date

In [None]:
cdr.groupby([cdr.entry_date_time.isnull(), cdr.entry_date_time_n_a]).size().unstack()

In [None]:
pd.crosstab(cdr.custody_date_na, cdr['date_time_of_custody_or_incident'].isnull())

In [None]:
cdr.drop(['entry_date_time_n_a', 'custody_date_na'], axis=1, inplace=True)

### 'department_type' is not consistently present. Drop it.

In [None]:
pd.crosstab(cdr.department_type.notnull(), cdr.form_version)

In [None]:
cdr.drop('department_type', axis=1, inplace=True)

<div style="background-color: #080;height: 20px"></div>

# 4. De-duplicate (oh boy, this is a doozy)

### In theory we should able to de-dup by the record id (`cdr_cdr_name`), BUT...
1. Sometimes two records with the same record ID differ on some columns. _Usually_ the version_number can tell us which one is the latest, but not always.
1. Sometimes we'll see multiple records for the same person with different record IDs (ugh).

...so it's complicated.

### After a LOT of tinkering, I think we can catch most reasonable duplicates with the following:
  * Drop all records that are a complete, exact duplicate of another (easy step)
  * Merge all records sharing a record id, `cdr_cdr_name`
  * Merge all records with the same name and date of birth
  * Merge all records with the same name and date of death

### 4a. Get rid of totally, utterly duplicate rows

In [None]:
pure_dups = cdr.duplicated()
cdr = cdr[~pure_dups]
print("Dropping %d rows that are 100%% duplicates of another row, leaving %d rows" % (pure_dups.sum(), len(cdr)))

### 4b. Define some dedup functions to help us

In [None]:
def merge_dup_records(frame):
    '''Master merge function. Creates one record from several that are known duplicates.'''
    # Ignore BJS records (these are from and old data dump),
    # unless there is no other option.
    form_versions_seen = set(frame['form_version'])
    if 'V_BJS' in form_versions_seen and len(form_versions_seen) > 1:
        frame = frame[frame['form_version'] != 'V_BJS']
        if len(frame) == 1:
            return frame.iloc[0], 'Keeping the only non-BJS record'
    
    
    # If one record has a higher version_number than the rest, keep that one.
    # If one record has a more recent report_date than the rest, keep that one.
    max_cols = ['version_number', 'report_date']
    for c in max_cols:
        maxval = frame[c].max()  # Implicitly ignores missing values, unless only missing values exist
        if pd.notnull(maxval):
            frame = frame[frame[c] == maxval]
            if len(frame) == 1:
                return frame.iloc[0], 'Keeping the record with greatest %s' % c

    # Otherwise, there's no way to flag the "one" right record (that I know of).
    # So we gotta merge them somehow...
    merged_rec = pd.Series(index = frame.columns,
                           name = 1000000 + frame.index[0])  # Give it a new, unique index
    awk = False
    for c in frame.columns:
        notnull = frame[c][frame[c].notnull()]

        # If all records have NA for this column, leave it as NA
        if len(notnull) == 0:
            merged_rec[c] = frame[c].iloc[0]
            continue

        # Only 1 unique not-null value? Keep that one.
        if len(notnull) == 1 or len(set(notnull)) == 1:
            merged_rec[c] = notnull.iloc[0]
            continue
        
        # Are we trying to merge record IDs? That's impossible anyway,
        # let's just concatenate them.
        if c == 'cdr_cdr_name':
            merged_rec[c] = '-'.join(notnull)
            continue
        
        # Well, poop. Multiple unique values for this column.
        # Take the most popular one ¯\_(ツ)_/¯
        # (Which will just be a random one if there's a tie ¯\_(ツ)_/¯ )
        awk = True
        vc = notnull.value_counts()
        keeper = vc.index[0]
        if vc.iloc[0] > vc.iloc[1]:
            print("  > Problem with column %s, keeping the most popular value, '%s'"
                  % (c, keeper), notnull.values)
        else:
            print("  > Problem with column %s, keeping an arbitrary tied-for-most-popular value, '%s'"
                  % (c, keeper), notnull.values)
        merged_rec[c] = keeper

    merged_rec['cdr_cdr_name'] = 'MERGED-DUPLICATES-%s' % merged_rec['cdr_cdr_name']
    if awk:
        return merged_rec, 'Merged awkwardly'
    else:
        return merged_rec, 'Merged smoothly enough'

In [None]:
def dedup_cdr_by(cdr, cols):
    '''Given a cdr dataframe, and a set of columns to use to identify duplicates, dedups/merges as needed.'''
    dups = cdr[cdr.duplicated(subset=cols, keep=False)]
    if not len(dups):
        return cdr
    unmerged_frames = []
    merged_records = []
    merge_methods = []
    for _, frame in dups.groupby(cols):
        rec, meth = merge_dup_records(frame)
        unmerged_frames.append(frame)
        merged_records.append(rec)
        merge_methods.append(meth)
        if 'awkward' in meth:
            print("...awkward merge complete for records at indices", frame.index)

    return cdr.drop(dups.index).append(merged_records), unmerged_frames, merged_records, merge_methods

### 4c. Dedup!

In [None]:
dedup_rounds = [
    ['cdr_cdr_name'],
    ['first_name', 'last_name', 'date_of_birth'],
    ['first_name', 'last_name', 'death_date'],
]
all_cdrs = [cdr]
all_merged_records = []
all_unmerged_frames = []
all_merge_methods = []
for i, dr in enumerate(dedup_rounds):
    print('**** Dedup step %d: find duplicates on these columns:' % (i + 1), dr)
    vals = dedup_cdr_by(all_cdrs[-1], dr)
    new_cdr, umf, mr, mm = vals
    all_cdrs.append(new_cdr)
    all_unmerged_frames.append(umf)
    all_merged_records.append(mr)
    all_merge_methods.append(mm)
    dropping = sum(len(f) for f in umf)
    print("Dropping %d duplicates and adding %d merged records, yielding %d records" % (
        dropping, len(mr), len(new_cdr)))

In [None]:
print("Ultimately removing %d duplicate records, leaving %d" % (
    len(all_cdrs[0]) - len(all_cdrs[-1]), len(all_cdrs[-1])))
cdr = all_cdrs[-1]

### 4d. If you want to inspect the merges (what records were merged into what), use the code below

In [None]:
for rd, cols in enumerate(dedup_rounds):
    print("In round %d, there were %d record merges based on" % (rd, len(all_merged_records[rd])), cols)

In [None]:
def show_merge(merge_round, idx):
    umf = all_unmerged_frames[merge_round][idx]
    mr = all_merged_records[merge_round][idx]
    renamed = mr.copy()
    renamed.name = 'merged'
    return umf.append(renamed)

In [None]:
show_merge(2, 1)

<div style="background-color: #080;height: 20px"></div>

# 5. Add new columns, rename/reorder, otherwise finalize dataframe

<div style="background-color: #008;height: 15px"></div>

## 5a. New columns and bookkeeping columns

### Column for the time (in days) between incarceration (or incident) and death
Note: for a few of these, the death date/time is before the incarceration/incident date. If they are only one day apart, we just call it 0 and assume it was a minor error. If they are more than a day apart, clearly there was a larger error, so we use a NaN value.

In [None]:
cdr.date_time_of_custody_or_incident.dtype

In [None]:
def get_days(dt):
    if dt.days < -1:
        return None
    elif dt.days == -1:
        return 0
    else:
        return dt.days

delta = cdr.death_date_and_time - cdr.date_time_of_custody_or_incident
print("For %d records with death date before custoday date, setting the days_from_custody_to_death to NaN" % (delta.dt.days < -1).sum())
cdr['days_from_custody_to_death'] = delta.apply(get_days)

In [None]:
cdr['suffix']

In [None]:
cdr['name_full'] = ''
for col in ['first_name', 'middle_name', 'last_name']:
    cdr['name_full'] = cdr['name_full'] + ' ' + cdr[col].fillna('')
cdr['name_full'] = cdr['name_full'].apply(lambda s: ' '.join(s.strip().split()))
cdr.loc[cdr['name_full'] == '', 'name_full'] = np.nan

### Tweak bookkeping columns

In [None]:
cdr['num_revisions'] = cdr['version_number'] - 1
cdr.drop(['version_type', 'version_number'], axis=1, inplace=True)

<div style="background-color: #008;height: 15px"></div>

## 5b. Rename and reorder columns sensibly

In [None]:
col_renames = {
    'first_name': 'name_first',
    'middle_name': 'name_middle',
    'last_name': 'name_last',
    'suffix': 'name_suffix',
    'cdr_cdr_name': 'record_id',
    'death_causer_other': 'who_caused_death_in_homicide_or_accident_other',
    'who_caused_the_death': 'who_caused_death_in_homicide_or_accident',
    'death_location': 'death_location_type',
    'death_location_elsewhere': 'death_location_type_other',
    'city': 'death_location_city',
    'county': 'death_location_county',
    'street_address': 'death_location_street_address',
    'entry_date_time': 'facility_entry_date_time',
    'pre_existing_medical_condition': 'death_from_pre_existing_medical_condition',
}

In [None]:
cdr.rename(columns=col_renames, inplace=True)

In [None]:
new_order = [
    # Record indexing columns
    'record_id',
    'num_revisions',
    'form_version',
    'report_date',
    'date_time_of_custody_or_incident',

    # Deceased personal information, demographics
    'name_first',
    'name_last',
    'name_middle',
    'name_suffix',
    'name_full',
    'date_of_birth',
    'age_at_time_of_death',
    'sex',
    'race',

    # Death event information
    'death_date',
    'death_date_and_time',
    'death_location_county',
    'death_location_city',
    'death_location_street_address',
    'death_location_type',
    'death_location_type_other',
    'death_from_pre_existing_medical_condition',
    'manner_of_death',
    'manner_of_death_description',
    'means_of_death',
    'means_of_death_other',
    'medical_cause_of_death',
    'medical_examinor_coroner_evalution',
    'medical_treatment',
    'days_from_custody_to_death',
    'who_caused_death_in_homicide_or_accident',
    'who_caused_death_in_homicide_or_accident_other',

    # Criminal information on deceased
    'offense_1',
    'offense_2',
    'offense_3',
    'type_of_offense',
    'type_of_offense_other',
    'were_the_charges',

    # Facility and agency information
    'facility_entry_date_time',
    'type_of_custody',
    'specific_type_of_custody_facility',
    'agency_address',
    'agency_city',
    'agency_county',
    'agency_name',
    'agency_zip',
    
    # Deceased behavior upon entry or custody
    'type_of_restraint',
    'under_restraint',
    'entry_behavior',
    'other_behavior',
    'exhibit_any_medical_problems',
    'exhibit_any_mental_health_problems',
    'make_suicidal_statements',
]

cdr = reorder_columns_and_check(cdr, new_order)

<div style="background-color: #008;height: 15px"></div>

## 5c. Final look before writing (NO CHANGES from here until writing)

In [None]:
cdr.sample(10)

In [None]:
show_notnull_chart(cdr)

In [None]:
print(len(cdr))

In [None]:
cdr.form_version.value_counts()

In [None]:
pd.crosstab(cdr.death_date.apply(lambda dt: dt.year), cdr.form_version)

### One more duplication sanity check -- any CDRs with the same name since 2018?

In [None]:
f = cdr[cdr.death_date.apply(lambda dt: dt.year) >= 2018]
vc = f['name_full'].value_counts()
vc[vc > 1]

### Looking at these cases, they seem fine. Just name coincidences.

In [None]:
f[f['name_full'] == 'RENE GARCIA']

In [None]:
f[f['name_full'] == 'JOSE RIOS']

<div style="background-color: #080;height: 20px"></div>

# 6. Write

In [None]:
if os.environ.get('CLEAN_CDR_DW') != 'TRUE':
    print("Not syncing to Data.world. To do so, set CLEAN_CDR_DW to 'TRUE'")
else:
    with dw.open_remote_file(OUTPUT_DW_PROJECT, OUTPUT_FILENAME) as w:
        print("Writing to data.world:", OUTPUT_FILENAME)
        cdr.to_csv(w, index=False)
        

In [None]:
if os.environ.get('CLEAN_CDR_S3') != 'TRUE':
    print("Not writing to s3. To do so, set CLEAN_CDR_S3 to 'TRUE'")
else:
    csv_buffer = StringIO()
    cdr.to_csv(csv_buffer, index=False)
    s3_resource = boto3.resource('s3')
    s3_resource.Object('tji-public-cleaned-datasets', OUTPUT_FILENAME).put(Body=csv_buffer.getvalue())


<div style="background-color: #A0A;height: 20px"></div>

# 7. TODOs -- problems that still exist in the data and need work

### [TODO] `age_at_time_of_death` and `date_of_birth` can sometimes be impossible

In [None]:
cdr.age_at_time_of_death[cdr.age_at_time_of_death < 0].value_counts().sort_index()

In [None]:
print("%d records have a birth date AFTER death date" % (cdr.date_of_birth > cdr.death_date).sum())

### [TODO] `type_of_offense` is a semicolon-separated list. Break this into multiple binary columns? Or reduce to only the most serious type? Unclear what to do if want to actually USE this column

In [None]:
s = cdr.type_of_offense
print("%d unique values for type_of_offense. Sample below:" % len(set(cdr.type_of_offense)))
s.value_counts().head(25)

In [None]:
s = cdr.type_of_offense_other
print("%d unique values for type_of_offense_other. Sample below:" % len(set(cdr.type_of_offense_other)))
s.value_counts().head(10)

### [TODO] Manner of death is sometimes 'OTHER' when it should not be
Two columns detail the manner of death: `manner_of_death` and `manner_of_death_description`. A handful of rows have `manner_of_death` = `OTHER` when they should have picked `ACCIDENT` or `HOMICIDE` -- as evinced by the information in the `manner_of_death_description` column:

In [None]:
frame = cdr[(cdr.manner_of_death == OTHER_SPECIFY) & (cdr.manner_of_death_description.notnull())]
for exempt in ('PENDING', 'DETERMIN'):
    frame = frame[~frame.manner_of_death_description.str.contains(exempt)]
print(len(frame), len(cdr))
frame[['form_version', 'manner_of_death_description', 'manner_of_death', 'medical_cause_of_death']]

### [TODO] (Related) Extract motor vehicle accidents into a unique type?

In [None]:
# This starter code should get it going
frame = cdr[(cdr.manner_of_death != 'ACCIDENTAL')]
frame = frame[(frame.medical_cause_of_death.fillna('').str.contains('ACCIDENT')) |
              frame.manner_of_death_description.fillna('').str.contains('ACCIDENT')]
print(len(frame))
frame[['medical_cause_of_death', 'means_of_death', 'means_of_death_other', 'manner_of_death', 'manner_of_death_description']]

### [TODO] Remove `medical_examinor_coroner_evalution`? Not sure it's useful.

In [None]:
pd.crosstab(cdr.medical_examinor_coroner_evalution, cdr.form_version).sort_values(['V_2005'],ascending=False)

### [TODO] Somehow blend the `entry_behavior`/`other_behavior` field from V_2005 with the `exhibit_any_medical_problems`, `exhibit_any_mental_health_problems` and `make_suicidal_statements` from V_2016

In [None]:
pd.crosstab(cdr.entry_behavior, cdr.form_version).sort_values(['V_2005'],ascending=False)

In [None]:
pd.crosstab(cdr.exhibit_any_medical_problems, cdr.form_version).sort_values(['V_2016'],ascending=False)

In [None]:
pd.crosstab(cdr.exhibit_any_mental_health_problems, cdr.form_version).sort_values(['V_2016'],ascending=False)

In [None]:
pd.crosstab(cdr.make_suicidal_statements, cdr.form_version).sort_values(['V_2016'],ascending=False)

### [TODO] Align various binary behavioral columns across form versions so we can keep them?

In [None]:
# These columns were completely removed. Some can stay once paired off
drop_cols