# Reformat old CDR master file for incorporation in new CDR dataset

### About the data

"Prior to 2013, the OIG only completed forms for unattended deaths (deaths that did not occur in an inpatient setting).  Beginning in January 2013, they began completing the form for all deaths."

[Here's](http://gritsforbreakfast.blogspot.com/2014/12/tdcj-reporting-change-explains-death-in.html) a blog post about this change.

Our CDR data, obtained via email from the Texas AG, does not include these deaths. In the previous version of the TJI website (2017), this missing data was obtained from the Texas Bureau of Justice Statistics (BJS), merged with the CDR data, and placed in a "master file," used by the old website.

This notebook reformats this old master file to be incorporated into our existing CDR data cleaning pipeline.

See the `clean_cdr.ipynb` file for how the output of this notebook is incorporated.

### Datasets used

Input:
* `tji/raw-and-processingcdr_2017_master_file.csv`

Output:
* `tji/raw-and-processing/reformatted_cdr_2017_master_file.csv`
  
##### Author: Everett Wetchler (everett.wetchler@gmail.com)

In [1]:
DTW_PROJECT_KEY = 'tji/raw-and-processing'
RAW_FILENAME = 'original/cdr_2017_master_file.csv'
CLEANED_FILENAME = 'reformatted_cdr_2017_master_file.csv'

In [2]:
import datadotworld as dw
import numpy as np
import pandas as pd

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

%load_ext watermark
%watermark -a 'Everett Wetchler' -d -t -z -w -p numpy,pandas,matplotlib,datadotworld

Everett Wetchler 2018-05-28 23:15:53 CDT

numpy 1.14.3
pandas 0.22.0
matplotlib 2.2.0
datadotworld 1.6.0
watermark 1.6.0


In [3]:
from lib.cleaning_tools import *

In [4]:
df = read_dtw_csv(DTW_PROJECT_KEY, RAW_FILENAME, encoding='latin-1')
print(df.shape)
df.head()

Writing CSV to temp file: /var/folders/dc/8cbxbsh515s908xl0zyprszm0000gn/T/tmp2oqj9cve
(7735, 94)


Unnamed: 0,report_date,pa_number,department_type,department_name,department_address,department_city,department_state,department_zip,department_phone,department_county,director_envelope,director_salutation,director_first_name,director_middle_name,director_last_name,reporter_name,reporter_email,first_name,middle_name,last_name,suffix,ethnicity,ethnicity_other,sex,birth_date,age,custody_date,custody_date_na,death_date,death_address,death_city,death_county,medical_evaluation,manner_death,manner_death_description,medical_cause_death,medical_condition,medical_treatment,medical_treatment_description,custody_general,custody_type,custody_type_facility,charges_1,charges_2,charges_3,charges_status,charges_violent,charges_child,charges_property,charges_drug,charges_other,charges_other_description,death_reason,injured_by_officers,injured_by_others,injured_by_self_accident,injured_by_suicide,injured_by_unknown,injured_by_na,under_restraint,restraint_handcuffs,restraint_shackles,restraint_other,restraint_other_description,appear_intoxicated,threaten_officers,resist_arrest,attempt_escape,fight_officers,use_weapon_officers,use_weapon_officers_description,behavior_other,behavior_other_description,behavior_na,death_weapon_handgun,death_weapon_rifle,death_weapon_baton,death_weapon_tazer,death_weapon_other,death_weapon_other_description,death_weapon_na,death_location,death_location_elsewhere,entry_date,entry_date_na,entry_behavior_intoxicated,entry_behavior_mental,entry_behavior_medical,entry_behavior_na,death_causer,death_causer_other,death_means,death_means_other,Summary
0,3/1/2005 0:00,PRISON20051001,STAGENCY,Texas Department Of Criminal Justice,,,TX,,,,,,,,,,,Clyde,L,Pady,,Anglo or White,,Male,10/1/2024,80.0,5/30/1995 0:00,,1/2/2005,,Tenessee Colony,,No,Natural Causes/Illness,Prostate/Lung Cancer,,Developed condition after admission,Yes,,Penitentiary,TDCJ,Michael,SEXUAL ASSLT/RAPE OF CHILD 11190000,SEXUAL ASSLT/RAPE OF CHILD 11190000,SEXUAL ASSLT/RAPE OF CHILD 11190000,Convicted,0.0,1.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,3/1/2005 0:00,PRISON20051002,STAGENCY,Texas Department Of Criminal Justice,,,TX,,,,,,,,,,,Winston,,Wilkins,,Anglo or White,,Male,9/27/1943,61.0,9/8/2003 0:00,,1/2/2005,,Galveston,,No,Natural Causes/Illness,Lung Cancer,,Pre-existing medical condition,Yes,,Penitentiary,TDCJ,Hospital Galveston,"STOLEN VEHICLE, UNAUTHORIZED USE 24110000",0,0,Convicted,0.0,0.0,1.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,3/1/2005 0:00,PRISON20051004,STAGENCY,Texas Department Of Criminal Justice,,,TX,,,,,,,,,,,Steven,,Fleming,,Black or African American,,Male,10/30/1967,37.0,10/7/1988 0:00,,1/4/2005,,Livingston,,No,Natural Causes/Illness,,,Pre-existing medical condition,Yes,,Penitentiary,TDCJ,Polunsky,SEXUAL ASSLT/RAPE AGG/SEX ABUSE 11180000,0,0,Convicted,1.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,3/1/2005 0:00,PRISON20051005,STAGENCY,Texas Department Of Criminal Justice,,,TX,,,,,,,,,,,James,B,Walker,,Black or African American,,Male,8/27/1943,61.0,4/23/2004 0:00,,1/4/2005,,Tenessee Colony,,No,Natural Causes/Illness,Hepatocellular Cancer,,Developed condition after admission,Yes,,Penitentiary,TDCJ,Michael,AGG SEXUAL ASSLT 11990003,AGG KIDNAPPING 10990004,0,Convicted,1.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,3/1/2005 0:00,PRISON20051007,STAGENCY,Texas Department Of Criminal Justice,,,TX,,,,,,,,,,,David,L,Jones,,Anglo or White,,Male,5/1/1960,45.0,11/8/2001 0:00,,1/6/2005,,Abilene,,No,Natural Causes/Illness,Cardiomyopathy Heart Failure,,Pre-existing medical condition,Yes,,Penitentiary,TDCJ,Robertson,INDECENCY W/ A CHILD SEXUAL CONTACT 36010001,0,0,Convicted,0.0,1.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


## Collapse binary "entry behavior" columns into a single string column

In [5]:
df['entry_behavior'] = ''
for suffix in ('intoxicated', 'medical', 'mental'):
    c = 'entry_behavior_%s' % suffix
    s = df[c].apply(lambda x: suffix if x == 1.0 else '')
    df['entry_behavior'] = df['entry_behavior'] + ' ' + s

df['entry_behavior'] = df['entry_behavior'].apply(lambda x: ('; '.join(x.strip().split())) if x.strip() else None)
df['entry_behavior'].value_counts()

medical                         322
intoxicated                     192
mental                           84
medical; mental                  40
intoxicated; medical             29
intoxicated; mental              19
intoxicated; medical; mental     13
Name: entry_behavior, dtype: int64

In [6]:
df['charges'] = ''
for suffix in ('violent', 'child', 'property', 'drug', 'other'):
    c = 'charges_%s' % suffix
    s = df[c].apply(lambda x: suffix if x == 1.0 else '')
    df['charges'] = df['charges'] + ' ' + s

df['charges'] = df['charges'].apply(lambda x: ('; '.join(x.strip().split())) if x.strip() else None)
df['charges'].value_counts()

violent                           2808
drug                              1244
other                              890
property                           817
child                              693
violent; child                     107
violent; property                  101
violent; drug                       76
drug; other                         55
violent; other                      27
property; drug                      26
violent; property; drug             13
property; other                     12
violent; drug; other                 7
child; other                         5
child; drug                          5
violent; property; other             4
property; drug; other                3
child; property                      2
violent; child; property             2
violent; child; drug                 1
violent; child; other                1
child; property; drug                1
violent; property; drug; other       1
Name: charges, dtype: int64

## Remove numerical charge code where present

In [7]:
df.charges_1.value_counts().head()

Murder                                            296
Aggravated Sexual Assault of a Child              166
HOMICIDE-MURDER 09150000                          153
DRIVING WHILE INTOXICATED 3RD OR MORE 54040011    145
SEXUAL ASSLT/RAPE AGG/SEX ABUSE 11180000          138
Name: charges_1, dtype: int64

In [8]:
def remove_code(charge):
    if pd.isnull(charge):
        return None
    parts = charge.split()
    if parts[-1].isdigit() and len(parts[-1]) > 5:
        return ' '.join(parts[:-1])
    else:
        return charge

for i in range(1, 4):
    col = 'charges_%d' % i
    df[col] = df[col].apply(remove_code)

df.charges_1.value_counts().head()

Murder                                   296
Aggravated Sexual Assault of a Child     166
HOMICIDE-MURDER                          153
DRIVING WHILE INTOXICATED 3RD OR MORE    145
SEXUAL ASSLT/RAPE AGG/SEX ABUSE          138
Name: charges_1, dtype: int64

## Rename columns to align with the other CDR dataset

In [9]:
col_renames = {
    'first_name': None,
    'middle_name': None,
    'last_name': None,
    'suffix': None,
    'age': 'age_at_time_of_death',
    'sex': 'sex',
    'birth_date': 'date_of_birth',
    'charges_1': 'offense_1',
    'charges_2': 'offense_2',
    'charges_3': 'offense_3',
    'report_date': None,
    'death_date': 'death_date_and_time',
    'pa_number': 'cdr_cdr_name',
    'department_name': 'agency_name',
    'department_address': 'agency_address',
    'department_city': 'agency_city',
    'department_zip': 'agency_zip',
    'department_county': 'agency_county',
    'death_causer': 'who_caused_the_death',
    'death_causer_other': None,
    'custody_date': 'date_time_of_custody_or_incident',
    'custody_date_na': None,
    'custody_general': 'type_of_custody',
    'custody_type': 'specific_type_of_custody_facility',
    'death_address': 'street_address',
    'death_city': 'city',
    'death_county': 'county',
    'death_location': None,
    'death_location_elsewhere': None,
    'death_means': 'means_of_death',
    'death_means_other': 'means_of_death_other',
    'manner_death': 'manner_of_death',
    'manner_death_description': 'manner_of_death_description',
    'entry_date': 'entry_date_time',
    'entry_date_na': 'entry_date_time_n_a',
    'medical_cause_death': 'medical_cause_of_death',
    'medical_condition': 'pre_existing_medical_condition',
    'medical_evaluation': 'medical_examinor_coroner_evalution',
    'medical_treatment': None,
    'ethnicity': None,
    'ethnicity_other': None,
    'behavior_other_description': 'other_behavior',
    'entry_behavior': None,
    'charges_status': 'were_the_charges',
    'charges': 'type_of_offense',
    'charges_other_description': 'type_of_offense_other',
}

In [10]:
# Drop unnecessary columns
df = df[list(col_renames.keys())]
# Rename (if the rename value is 'None', keep the original name)
df.columns = [col_renames[c] or c for c in df.columns]

In [11]:
df.head()

Unnamed: 0,first_name,middle_name,last_name,suffix,age_at_time_of_death,sex,date_of_birth,offense_1,offense_2,offense_3,report_date,death_date_and_time,cdr_cdr_name,agency_name,agency_address,agency_city,agency_zip,agency_county,who_caused_the_death,death_causer_other,date_time_of_custody_or_incident,custody_date_na,type_of_custody,specific_type_of_custody_facility,street_address,city,county,death_location,death_location_elsewhere,means_of_death,means_of_death_other,manner_of_death,manner_of_death_description,entry_date_time,entry_date_time_n_a,medical_cause_of_death,pre_existing_medical_condition,medical_examinor_coroner_evalution,medical_treatment,ethnicity,ethnicity_other,other_behavior,entry_behavior,were_the_charges,type_of_offense,type_of_offense_other
0,Clyde,L,Pady,,80.0,Male,10/1/2024,SEXUAL ASSLT/RAPE OF CHILD,SEXUAL ASSLT/RAPE OF CHILD,SEXUAL ASSLT/RAPE OF CHILD,3/1/2005 0:00,1/2/2005,PRISON20051001,Texas Department Of Criminal Justice,,,,,,,5/30/1995 0:00,,Penitentiary,TDCJ,,Tenessee Colony,,,,,,Natural Causes/Illness,Prostate/Lung Cancer,,,,Developed condition after admission,No,Yes,Anglo or White,,,,Convicted,child,
1,Winston,,Wilkins,,61.0,Male,9/27/1943,"STOLEN VEHICLE, UNAUTHORIZED USE",0,0,3/1/2005 0:00,1/2/2005,PRISON20051002,Texas Department Of Criminal Justice,,,,,,,9/8/2003 0:00,,Penitentiary,TDCJ,,Galveston,,,,,,Natural Causes/Illness,Lung Cancer,,,,Pre-existing medical condition,No,Yes,Anglo or White,,,,Convicted,property,
2,Steven,,Fleming,,37.0,Male,10/30/1967,SEXUAL ASSLT/RAPE AGG/SEX ABUSE,0,0,3/1/2005 0:00,1/4/2005,PRISON20051004,Texas Department Of Criminal Justice,,,,,,,10/7/1988 0:00,,Penitentiary,TDCJ,,Livingston,,,,,,Natural Causes/Illness,,,,,Pre-existing medical condition,No,Yes,Black or African American,,,,Convicted,violent,
3,James,B,Walker,,61.0,Male,8/27/1943,AGG SEXUAL ASSLT,AGG KIDNAPPING,0,3/1/2005 0:00,1/4/2005,PRISON20051005,Texas Department Of Criminal Justice,,,,,,,4/23/2004 0:00,,Penitentiary,TDCJ,,Tenessee Colony,,,,,,Natural Causes/Illness,Hepatocellular Cancer,,,,Developed condition after admission,No,Yes,Black or African American,,,,Convicted,violent,
4,David,L,Jones,,45.0,Male,5/1/1960,INDECENCY W/ A CHILD SEXUAL CONTACT,0,0,3/1/2005 0:00,1/6/2005,PRISON20051007,Texas Department Of Criminal Justice,,,,,,,11/8/2001 0:00,,Penitentiary,TDCJ,,Abilene,,,,,,Natural Causes/Illness,Cardiomyopathy Heart Failure,,,,Pre-existing medical condition,No,Yes,Anglo or White,,,,Convicted,child,


## Write

In [12]:
with dw.open_remote_file(DTW_PROJECT_KEY, CLEANED_FILENAME) as w:
    print("Writing to data.world:", CLEANED_FILENAME)
    df.to_csv(w, index=False)

Writing to data.world: reformatted_cdr_2017_master_file.csv
