# Clean and reformat CDR data from its multi-tab excel file into a single csv

### About the data

CDR data is tricky -- the form used by law enforcement has changed over time, first in 2005, then again in 2016. The data before 2005 is known to be be sparse and poorly enforced, so we ignore those entries. The 2005 and 2016 versions of the form have some overlap and some differences, so we must be careful in how we merge them.

Further, some older deaths that happened in an inpatient setting are not reported in these dataset. In the past, for the first version of the TJI website, these records were requested. We have an "old master file" that was used in the first TJI website that includes these deaths, but also many others. We need to load this file and include any records that we don't yet have.

**In this repo you can find blank versions of the [2005](https://github.com/texas-justice-initiative/data-processing/blob/master/forms/CDR%20Form%20Version%202005.pdf) and [2016](https://github.com/texas-justice-initiative/data-processing/blob/master/forms/CDR%20Form%20Version%202016.pdf) forms, to see for yourself exactly what fields are collected and how.**

### Datasets used


* Input:
  * `tji/raw-and-processing/CDR - All Reports.xlsx`
  * `tji/raw-and-processing/reformatted_cdr_2017_master_file`
  * `tji/auxiliary-datasets/agencies_and_counties`
* Output:
  * `tji/deaths-in-custody/cleaned_custodial_death_reports.csv`
  
##### Author: Everett Wetchler (everett.wetchler@gmail.com)

## STEPS

**1. Setup and read data**
- 1a. Configuration and imports
    - Constants
    - Libraries
- 1.b Read data
    - Drop records with death date pre-2005
- 1c. Helper methods/code
    

**2. Join disparate datasets, rename and reformat columns**
- 2a. Reduce columns used and rename
    - Drop columns that we either don't need or haven't written code to clean yet
    - Rename columns to be more machine friendly
- 2b. Add rows from the old master file to get BJS inpatient deaths 
    - Merge the "old master" file with the other CDR records
- 2c. Column type conversions
    - Convert date columns
    - Float columns covert from string
    - Upcase string columns
    - Add `death_date` column
    
**3. Begin Deep Cleaning**
- 3a. Standardize `race` values
- 3b. Standardize agency information
- 3c. Fix death-information related columns
    - Fix `Death Location`
    - fix `Means of Death`
    - Fix `manner_of_death` and `manner of death`
    - Fix `pre_existing_medical_condition`
    - Fix `who_caused_the_death`
    - Fix `medical_examinor_coroner_evalution`
- 3d. Fix other columns one-by-one
    - Standardize `gender`
    - Fix `were_there_charges`
    - Fix `type_of_custody`
    - Fix `specific_type_of_custody_facility`
- 3e. Identify and drop a range of unnecessary columns
    - `entry_date_time_n_a` and `custody_date_na`
    - `department_type`

**4. Deduplication**
- 4a. Whole row duplicates
- 4b. Define some dedup functions
- 4c. Dedup
- 4d. Inspect results

**5. Add new columns, rename/reorder, otherwise finalize dataframe**
- 5a. New columns and bookkeeping columns
    - Column for the time (in days) between incarceration (or incident) and death
- 5b. Rename and reorder columns sensibly
- 5c. Final look before writing


**6. Write**


**7. TODOs -- problems that still exist in the data and need work**

---
Let's begin...

<div style="background-color: #080;height: 20px"></div>

# 1. Setup and read data

<div style="background-color: #008;height: 20px"></div>

## 1a. Configuration and imports

In [1]:
# Constants so we don't have to hunt through the code to tweak these in many places

DW_PROJECT_CDR = 'tji/deaths-in-custody'
DW_PROJECT_OIS = 'tji/officer-involved-shootings'
DW_PROJECT_AUXILIARY_DATASETS = 'tji/auxiliary-datasets'
DW_PROJECT_RAW_AND_PROCESSING = 'tji/raw-and-processing'

OUTPUT_DW_PROJECT = DW_PROJECT_CDR
OUTPUT_FILENAME = 'cleaned_custodial_death_reports.csv'

In [2]:
# Import ALL the things

import os
import sys
import json
import boto3
import datetime
import numpy as np
import pandas as pd
import datadotworld as dw
import pygsheets
import getpass

from io import StringIO
from lib.cleaning_tools import *

sys.path.append(os.getcwd() + '/../data_cleaning')

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

%load_ext watermark
%watermark -a "Everett Wetchler, Aiden Yang" -d -t -z -w -p numpy,pandas,datadotworld,pygsheets

Everett Wetchler, Aiden Yang 2021-01-03 09:53:36 MST 

numpy 1.17.2
pandas 1.1.2
datadotworld 1.7.0
pygsheets 2.0.3.1
watermark 2.0.2


In [3]:
# Custom libraries specific to this project
from lib.standardize_police_agency_names import standardize_agency_name

<div style="background-color: #008;height: 20px"></div>

## 1.b Read data and take a glance

In [4]:
datasets = dw.load_dataset(DW_PROJECT_AUXILIARY_DATASETS, force_update=True)
agency_county = datasets.dataframes['agencies_and_counties']
agency_county = agency_county.set_index('agency')['county'].to_dict()

In [5]:
if getpass.getuser() == 'hlukas':
    #gc = pygsheets.authorize('/Users/hlukas/Downloads/client_secret_lghhager.json')
    gc = pygsheets.authorize('/Users/hlukas/Downloads/online_service_file.json')
else:
    gc = pygsheets.authorize(service_file = '/home/ec2-user/data-processing/automation/client_secret.json')

In [6]:
#gc.drive.enable_team_drive('0ACeQWapAwOLqUk9PVA')

titles = gc.spreadsheet_titles()
ids = gc.spreadsheet_ids()

cdr_id = [ids[index] for index in range(len(titles)) if titles[index] == 'CDR Reports All'][0]
output = gc.open_by_key(cdr_id)

dfs = {}

# drop empty columns or the concatenation does not work
for sheet_name in ['Form Version 2005', 'Form Version 2016', 'Older Forms']:
    sheet = [i for i in range(4) if output.worksheets()[i].title == sheet_name][0]
    wk = output[sheet]
    try:
        dfs[sheet_name] = wk.get_as_df(include_tailing_empty = True).drop(columns = [''], axis= 1)
    except KeyError:
        dfs[sheet_name] = wk.get_as_df(include_tailing_empty = True)
    
dfs['Form Version 2005']['form_version'] = 'V_2005'
dfs['Form Version 2016']['form_version'] = 'V_2016'
dfs['Older Forms']['form_version'] = 'V_OLDER'

cdr = pd.concat([dfs['Form Version 2005'], dfs['Form Version 2016']])
print('Read %d raw CDR records. Ignoring %d from older form versions (pre-2005).' % (
    sum(len(x) for x in dfs), len(dfs['Older Forms'])))
print('Keeping %d records using form version 2005, and %d using version 2016 (keeping %d in total)' % (
    len(dfs['Form Version 2005']), len(dfs['Form Version 2016']), len(cdr)))

Read 45 raw CDR records. Ignoring 5468 from older form versions (pre-2005).
Keeping 5156 records using form version 2005, and 4764 using version 2016 (keeping 9920 in total)


In [None]:
# datasets = dw.load_dataset(DW_PROJECT_AUXILIARY_DATASETS, force_update=True)
# agency_county = datasets.dataframes['agencies_and_counties']
# agency_county = agency_county.set_index('agency')['county'].to_dict()

In [None]:
# dfs = read_dtw_excel(DW_PROJECT_RAW_AND_PROCESSING, 'original/CDR Reports All.xlsx')
# dfs['Form Version 2005']['form_version'] = 'V_2005'
# dfs['Form Version 2016']['form_version'] = 'V_2016'
# dfs['Older Forms']['form_version'] = 'V_OLDER'
# cdr = pd.concat([dfs['Form Version 2005'], dfs['Form Version 2016']])
# print('Read %d raw CDR records. Ignoring %d from older form versions (pre-2005).' % (
#     sum(len(x) for x in dfs), len(dfs['Older Forms'])))
# print('Keeping %d records using form version 2005, and %d using version 2016 (keeping %d in total)' % (
#     len(dfs['Form Version 2005']), len(dfs['Form Version 2016']), len(cdr)))

### In theory, all these records should be for deaths in 2005 or later. Let's double check and drop any miscreants.

In [None]:
assert cdr['Death Date and Time'].isnull().sum() == 0

In [None]:
cdr['Death Date and Time'] = pd.to_datetime(cdr['Death Date and Time'])


In [None]:
before = len(cdr)
cdr = cdr[cdr['Death Date and Time'].dt.year >= 2005]
after = len(cdr)
print('Dropped %d (of %d) reports for deaths before 2005, leaving %d' % (before - after, before, after))

In [None]:
print("Data goes from %s to %s" % (cdr['Death Date and Time'].min().strftime("%Y-%m-%d"),
                                   cdr['Death Date and Time'].max().strftime("%Y-%m-%d")))

### A quick look at the raw data

In [None]:
dfs['Form Version 2005'].head()

In [None]:
dfs['Form Version 2016'].head()

<div style="background-color: #008;height: 20px"></div>

## 1c. Helper methods/code

In [None]:
def show_notnull_chart(cdr, vertical=False):
    '''Show which columns have null values, how often, and break down by form version.'''
    versions = sorted(set(cdr.form_version))
    frames = []
    for v in versions:
        fr = cdr[cdr.form_version == v]
        s = fr.notnull().mean()
        frames.append(s)
    frame = pd.concat(frames, axis=1)
    frame.columns = versions
    frame.sort_index(inplace=True)
    if not vertical:
        frame = frame.T
    return frame.style.background_gradient(cmap='RdYlGn', axis=(0 if vertical else 1))

In [None]:
OTHER_SPECIFY = 'OTHER, SPECIFY'

<div style="background-color: #080;height: 20px"></div>

# 2. Join disparate datasets, rename and reformat columns

<div style="background-color: #008;height: 15px"></div>

## 2a. Reduce the number of columns that we keep, and rename them more conveniently

#### Look at how often our columns are populated, broken down by form version (since each ask slightly different questions)

In [None]:
show_notnull_chart(cdr)

### Drop columns that we either don't need or haven't written code to clean yet.

In [None]:
keep_text = '''Both forms

- Age At Time Of Death
- Agency Address
- Agency City
- Agency Name
- Agency Zip
- CDR: CDR Name
- City
- County
- Date of Birth
- Date/Time of Custody or Incident
- Death Date and Time
- Death Location
- Death Location Elsewhere
- Entry Date Time
- Entry Date Time N/A
- First Name
- Middle Name
- Last Name
- Suffix
- Manner of Death
- Manner of Death Description
- Means of Death
- Means of Death Other
- Medical Cause of Death
- Medical Examinor/Coroner Evalution?
- Medical Treatment
- Offense 1
- Offense 2
- Offense 3
- Pre existing medical condition?
- Report Date
- Sex
- Specific Type of Custody/Facility
- Street Address
- Type of Custody
- Type of Offense
- Type of Offense, Other
- Version Number
- Version Type
- Were the Charges:
- Who caused the death?
- form_version
- Type of Restraint
- Under Restraint

2005 form only

- Agency County
- Custody Date NA
- Death Causer Other
- Department Type
- Entry Behavior
- Ethnicity
- Ethnicity Other
- Other Behavior
- Specify Other Behavior

2016 form only

- Exhibit any medical problems?
- Exhibit any mental health problems?
- Make suicidal statements?
- Race'''
keep_cols = []
drop_cols = set(cdr.columns)
for line in keep_text.splitlines():
    if line.startswith('- '):
        colname = line[2:]
        keep_cols.append(colname)
        drop_cols.remove(colname)

### Rename columns to be more machine friendly (lowercase, snake_case, and remove non-alphanumeric characters)

In [None]:
col_renames = {}
for c in keep_cols:
    new_name = ''.join([ch if ch.isalnum() else ' ' for ch in c.lower()])
    new_name = '_'.join(new_name.strip().split())
    col_renames[c] = new_name

In [None]:
cdr = cdr[list(col_renames.keys())]
cdr.rename(columns=col_renames, inplace=True)
cdr.head()

<div style="background-color: #008;height: 15px"></div>

## 2b. Add rows from the old master file to get BJS inpatient deaths (will dedup later)

In [None]:
# gc.drive.enable_team_drive('0ACeQWapAwOLqUk9PVA')
# titles = gc.spreadsheet_titles()
# ids = gc.spreadsheet_ids()

# cdr_id = [ids[index] for index in range(len(titles)) if titles[index] == 'reformatted_cdr_2017_master_file'][0]
# output = gc.open_by_key(cdr_id)

# wk = output[0]

# dfs = {}

# # drop empty columns or the concatenation does not work
# for sheet_name in ['Form Version 2005', 'Form Version 2016', 'Older Forms']:
#     sheet = [i for i in range(4) if output.worksheets()[i].title == sheet_name][0]
#     wk = output[sheet]
#     try:
#         dfs[sheet_name] = wk.get_as_df(include_tailing_empty = True).drop(columns = [''], axis= 1)
#     except KeyError:
#         dfs[sheet_name] = wk.get_as_df(include_tailing_empty = True)

In [None]:
datasets = dw.load_dataset(DW_PROJECT_RAW_AND_PROCESSING, force_update=True)
old_master = datasets.dataframes['reformatted_cdr_2017_master_file']
old_master['form_version'] = 'V_BJS'
print(old_master.shape)
old_master.head()

In [None]:
cdr.shape, old_master.shape

#### There will be a few columns that the old master file doesn't have, which is fine

In [None]:
set(cdr.columns) - set(old_master.columns)

#### However, it shouldn't have any novel columns that the *other* data doesn't have

In [None]:
assert(len(set(old_master.columns) - set(cdr.columns)) == 0)

### Merge the "old master" file with the other CDR records

In [None]:
cdr = pd.concat([cdr, old_master])
cdr.reset_index(inplace=True, drop=True)
cdr.sort_values('form_version', inplace=True)
cdr.head()

<div style="background-color: #008;height: 15px"></div>

## 2c. Data is merged into one frame. Now do column type conversions

### Convert date columns

In [None]:
s1 = cdr.dtypes
convert_date_cols(cdr)
s2 = cdr.dtypes
different = s1[s1 != s2].index.tolist()
print("Changed %d cols to datetime (from some other dtype):" % len(different), different)

### As shown below, some float columns are in string format. We'll fix that next.

In [None]:
cdr.entry_date_time_n_a.value_counts().head()

In [None]:
def float_or_nan(val):
    try:
        return float(val)
    except ValueError:
        pass
    except TypeError:
        pass
    print("- BAD VALUE (returning NaN):", val)
    return np.NAN

float_cols = [
 'age_at_time_of_death',
 'agency_zip',
 'custody_date_na',
 'entry_date_time_n_a',
 'version_number'
]

for c in float_cols:
    print("Converting", c)
    cdr[c] = cdr[c].apply(float_or_nan).astype(float)

### Upcase string cell contents

In [None]:
upcase_strip_string_cells(cdr)

### Add a `death_date` column (without the death time)

In [None]:
cdr['death_date'] = pd.to_datetime(cdr.death_date_and_time.apply(lambda dt: datetime.date(dt.year, dt.month, dt.day)))

In [None]:
cdr_ready_to_clean = cdr.copy()

<div style="background-color: #080;height: 20px"></div>

# 3. Begin deep cleaning

In [None]:
# When we're working on modifying this file, and we need to reset
# the dataset, it's annoying to run the whole notebook, fetch from
# data.world, etc. So you can just rerun from here downward.
cdr = cdr_ready_to_clean.copy()

<div style="background-color: #008;height: 15px"></div>

## 3a.  Standardize `race` values

### First we need to merge race and ethnicity columns (the 2005 form calls it 'ethnicity', the 2016 'race')

#### Have a look at the values first

In [None]:
cdr.race.value_counts()

In [None]:
cdr.ethnicity.value_counts()

In [None]:
# Collapse the various OTHER-like values
cdr.loc[cdr.ethnicity.fillna('').str.contains('OTHER'), 'ethnicity'] = 'OTHER'

In [None]:
# When choosing the 'Other' ethnicity in the 2005 form version, there
# is a subsequent field to specify. Though clearly some of them are not
# truly 'other' ethnicities. See:
cdr.ethnicity_other.value_counts()

In [None]:
# Replace the nonsensical 0 with NaN
cdr.loc[cdr.ethnicity_other.astype(str) == '0', 'ethnicity_other'] = None

In [None]:
# Let's make sure nobody is filling out the "other ethnicity" column when they shouldn't...
cdr[(~(cdr.ethnicity == 'OTHER') & cdr.ethnicity_other.notnull())][['ethnicity', 'ethnicity_other']]

In [None]:
# Good. Let's transfer those specified ethnicity_other values into
# the 'ethnicity' column, so we can merge everything at once.
other_eth = (cdr.ethnicity == 'OTHER')
print('Merging %d "ethnicity_other" values into the main "ethnicity" column' % other_eth.sum())
cdr.loc[other_eth, 'ethnicity'] = cdr.ethnicity_other[other_eth]
cdr.drop('ethnicity_other', axis=1, inplace=True)

In [None]:
# Make a single 'race' column that has merged, simplified values of race or ethnicity.
race_eth_list = []
for race, eth in zip(cdr.race, cdr.ethnicity):
    # Only one of (race, eth) should be set
    assert pd.isnull(race) or pd.isnull(eth)
    if pd.isnull(race):
        if pd.isnull(eth):
            race_eth_list.append(None)
            continue
        x = eth
    else:
        x = race
    race_eth_list.append(x)

In [None]:
cdr['race'] = race_eth_list
cdr.drop('ethnicity', axis=1, inplace=True)
cdr.race.value_counts()

### Collapse `race` values into a smaller set (white/hispanic/black/other)

In [None]:
race_before = cdr.race.copy()
race_before.name = 'race_before'

In [None]:
standardize_race_cols(cdr)
cdr.race.value_counts()

In [None]:
cdr.groupby([race_before, 'race']).size()

<div style="background-color: #008;height: 15px"></div>

## 3b. Standardize agency information

In [None]:
# Standardize agency name (so we can join/compare across datasets)
cdr['agency_name'] = cdr['agency_name'].apply(standardize_agency_name)

# Lookup county name by agency name. If this fails, fall back
# on the county specified in the form, if it exists.
cdr['agency_county'] = cdr['agency_county'].str.upper()
county_lookup = cdr['agency_name'].apply(lambda name: agency_county.get(name, np.nan))
cdr['agency_county'] = county_lookup.fillna(cdr['agency_county'])

# Manually handle one major agency
cdr.loc[cdr['agency_name'] == 'TEXAS DEPT OF CRIMINAL JUSTICE', 'agency_county'] = 'STATE'

In [None]:
# Check that we are only missing counties for a paltry few records now.
cdr[cdr['agency_county'].isnull()]['agency_name'].value_counts()

In [None]:
# Clearly 'TEST CDR AGENCY' is meant to be ignored
test_agencies = cdr['agency_name'] == 'TEST CDR AGENCY'
cdr = cdr[~test_agencies]
print("Dropping %d records from 'TEST CDR AGENCY', leaving %d records" % (test_agencies.sum(), len(cdr)))

<div style="background-color: #008;height: 15px"></div>

## 3c.  Fix death-information related columns

#### The two form versions offer slightly different options (e.g. V_2005 uses 'AT MEDICAL FACILITY' while V_2016 uses 'MEDICAL FACILITY'). Collapse values to stabilize the options across form versions, and roll up rare values into a single 'OTHER' value.

### Fix: `death_location`

In [None]:
pd.crosstab(cdr.death_location, cdr.form_version)

#### We now have 0/1 that throws an error if not coerced

In [None]:
cdr['death_location'] = np.where(cdr['death_location'].isin([0,1]),
                                 np.NaN,
                                 cdr['death_location'])

In [None]:
replacements = {
    'AT MEDICAL FACILITY': 'MEDICAL FACILITY',
    'AT LAW ENFORCEMENT FACILITY': 'LAW ENFORCEMENT FACILITY',
    'AT THE CRIME/ARREST SCENE': 'CRIME/ARREST SCENE',
    'SCENE OF INCIDENT': 'CRIME/ARREST SCENE',
    'LAW ENFORCEMENT FACILITY/BOOKING CENTER': 'LAW ENFORCEMENT FACILITY',
    'DEAD ON ARRIVAL AT MEDICAL FACILITY': 'EN ROUTE TO MEDICAL FACILITY',
    'EN ROUTE TO BOOKING CENTER/POLICE LOCKUP': 'EN ROUTE TO LAW ENFORCEMENT FACILITY',
    'ELSEWHERE': OTHER_SPECIFY,
    'ELSEWHERE, SPECIFY': OTHER_SPECIFY,
}
cdr['death_location'] = cdr['death_location'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr.death_location, cdr.form_version)

### Fix: `means_of_death`

In [None]:
pd.crosstab(cdr.means_of_death, cdr.form_version)

In [None]:
replacements = {
    'NOT APPLICABLE, CAUSE OF DEATH WAS ILLNESS/NATURAL CAUSE': 'NOT APPLICABLE',
    'NOT APPLICABLE; CAUSE OF DEATH WAS INTOXICATION OR ILLNESS/NATURAL CAUSES': 'NOT APPLICABLE',
    'OTHER': OTHER_SPECIFY,
    'KNIFE, CUTTING INSTRUMENT': 'KNIFE / EDGED INSTRUMENT',
    'BLUNT INSTRUMENT': 'BATON / BLUNT INSTRUMENT',
    "DON'T KNOW": 'UNKNOWN',
    "DON\\'T KNOW": 'UNKNOWN',
    'RIFLE/SHOTGUN': 'FIREARM',
}
cdr['means_of_death'] = cdr['means_of_death'].apply(lambda x:  None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr.means_of_death, cdr.form_version)

### It seems officers sometimes misuse the 'other' option, selecting it when another category is more appropriate:

In [None]:
cdr[cdr.means_of_death == OTHER_SPECIFY]['means_of_death_other'].value_counts().head()

### Let's roll all rare and 'unknown'-type values into the 'other' category. 'VEHICLE ACCIDENT' in particular is new in the 2016 form, only indicated as 'other' in older forms.

In [None]:
other_values = ['UNKNOWN', 'VEHICLE ACCIDENT', 'KNIFE / EDGED INSTRUMENT', 'BATON / BLUNT INSTRUMENT']
indices = cdr['means_of_death'].isin(other_values)
cdr.loc[indices, 'means_of_death_other'] = cdr.loc[indices, 'means_of_death']
cdr.loc[indices, 'means_of_death'] = OTHER_SPECIFY

pd.crosstab(cdr.means_of_death, cdr.form_version)

### Fix: `manner_of_death` and  `manner_of_death_description`

In [None]:
pd.crosstab(cdr.form_version, cdr.manner_of_death).T

In [None]:
pd.crosstab(cdr.manner_of_death, cdr.manner_of_death_description.notnull())

In [None]:
replacements = {
    'NATURAL': 'NATURAL CAUSES/ILLNESS',
    'JUSTIFIABLE HOMICIDE': 'HOMICIDE',
    'HOMICIDE BY LAW ENFORCEMENT/CORRECTIONAL STAFF': 'HOMICIDE',
    'OTHER HOMICIDE': 'HOMICIDE',
    'HOMICIDE (INCLUDES JUSTIFIABLE HOMICIDE)': 'HOMICIDE',
    'ACCIDENTAL INJURY CAUSED BY OTHERS': 'ACCIDENTAL',
    'ACCIDENTAL INJURY TO SELF': 'ACCIDENTAL',
    'OTHER': OTHER_SPECIFY,
    'OTHER - SPECIFY': OTHER_SPECIFY,
}
cdr['manner_of_death'] = cdr['manner_of_death'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))

# In past versions, "pending autopsy results" was not an option, and reports had "other"
# checked with some mention of pending autopsy in the free field. We emulate this here
# to preserve consistency across form versions.
other_values = ['PENDING AUTOPSY RESULTS', 'COULD NOT BE DETERMINED']
indices = cdr['manner_of_death'].isin(other_values)
cdr.loc[indices, 'manner_of_death_description'] = cdr.loc[indices, 'manner_of_death']
cdr.loc[indices, 'manner_of_death'] = OTHER_SPECIFY

pd.crosstab(cdr.form_version, cdr.manner_of_death).T

### A few suicides by hanging have the wrong `means_of_death`

In [None]:
frame = cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death != 'HANGING, STRANGULATION')]
frame = frame[(frame.medical_cause_of_death.fillna('').str.contains('HANGING')) |
              frame.manner_of_death_description.fillna('').str.contains('HANGING')]
print(len(frame))
frame[['medical_cause_of_death', 'means_of_death', 'means_of_death_other', 'manner_of_death', 'manner_of_death_description']].head()

In [None]:
cdr.loc[frame.index, 'means_of_death'] = 'HANGING, STRANGULATION'

# Be sure we got them all
frame = cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death != 'HANGING, STRANGULATION')]
frame = frame[frame.medical_cause_of_death.fillna('').str.contains('HANGING')]
assert len(frame) == 0

### For a few suicides, the `manner_of_death` is 'NOT APPLICABLE', which makes no sense. We need to change these to OTHER, though (as shown below) more is going on

In [None]:
frame = cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death == 'NOT APPLICABLE')]
print(len(frame))
frame[['means_of_death', 'means_of_death_other', 'manner_of_death', 'manner_of_death_description', 'medical_cause_of_death']].head()

In [None]:
cdr.loc[frame.index, 'means_of_death'] = OTHER_SPECIFY
assert len(cdr[(cdr.manner_of_death == 'SUICIDE') & (cdr.means_of_death == 'NOT APPLICABLE')]) == 0

### Fix: `pre_existing_medical_condition` 

In [None]:
pd.crosstab(cdr.form_version, cdr.pre_existing_medical_condition).T

In [None]:
replacements = {
    'DECEASED DEVELOPED CONDITION AFTER ADMISSION': 'DEVELOPED CONDITION AFTER ADMISSION',
    "DON'T KNOW": 'UNKNOWN',
    "DON\\'T KNOW": 'UNKNOWN',
    'NOT APPLICABLE; CAUSE OF DEATH WAS ACCIDENTAL INJURY, INTOXICATION, SUICIDE OR HOMICIDE': 'NOT APPLICABLE',
    'COULD NOT BE DETERMINED': 'UNKNOWN',
    'PRE-EXISTING MEDICAL CONDITION': 'YES',
}
cdr['pre_existing_medical_condition'] = cdr['pre_existing_medical_condition'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr.form_version, cdr.pre_existing_medical_condition).T

### Fix: `who_caused_the_death` 

NOTE: This question is framed as follows:
* 2005 form: "If the death was an accident or homicide, who caused the death?"
* 2016 form: "If the death was an accident, homicide **or suicide**, who caused the death?" (emphasis added)

Thus, we need to:
1. Collapse near-identical values from different forms, similar to the other areas here.
1. Remove suicides from the 2016 responses, as they skew the data (see below). While we're add it, change ANY entries that are not of type homicide/suicide to have "NOT APPLICABLE" as the value.

#### Collapse near-identical values from different forms, similar to the other areas here.

In [None]:
pd.crosstab(cdr.who_caused_the_death, cdr.death_date_and_time.dt.year)

In [None]:
replacements = {
    'DECEASED': 'DECEDENT',
    "DON'T KNOW": 'UNKNOWN',
    "DON\\'T KNOW": 'UNKNOWN',
    'LAW ENFORCEMENT/CORRECTIONAL STAFF': 'LAW ENFORCEMENT/CORRECTIONAL PERSONNEL',
    'NOT APPLICABLE; CAUSE OF DEATH WAS SUICIDE, INTOXICATION OR ILLNESS/NATURAL CAUSES': 'NOT APPLICABLE',
    'OTHER DETAINEES': 'OTHER DETAINEE(S)',
    'OTHER PERSONS': 'OTHER CIVILIAN(S)',
    'ACCIDENTAL INJURY TO SELF': 'ACCIDENTAL',
    'UNKNOWN PERSON(S) CAUSED THE INJURY': 'UNKNOWN',
    'UNKNOWN WHETHER DECEDENT SUSTAINED A FATAL INJURY': 'UNKNOWN',
}
cdr['who_caused_the_death'] = cdr['who_caused_the_death'].apply(lambda x:  None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['who_caused_the_death'], cdr.form_version)

#### Remove suicides from the 2016 responses to 'who caused the death' question, as in previous years 'who caused the death' was only asked for homicides and accidents. (see below)

In [None]:
pd.crosstab(cdr.who_caused_the_death, cdr.manner_of_death)

In [None]:
cdr.loc[~cdr.manner_of_death.isin(['HOMICIDE', 'ACCIDENTAL', OTHER_SPECIFY]), 'who_caused_the_death'] = 'NOT APPLICABLE'
pd.crosstab(cdr.who_caused_the_death, cdr.manner_of_death)

### Fix `medical_examinor_coroner_evalution`

In [None]:
pd.crosstab(cdr.form_version, cdr.medical_examinor_coroner_evalution).T

In [None]:
def fix_coroner(result):
    if pd.isnull(result):
        return None
    result = result.strip()
    if result.startswith('YES'):
        return 'YES'
    elif result.startswith('NO'):
        return 'NO'
    return None

cdr['medical_examinor_coroner_evalution'] = cdr['medical_examinor_coroner_evalution'].apply(fix_coroner)
pd.crosstab(cdr.form_version, cdr.medical_examinor_coroner_evalution).T

<div style="background-color: #008;height: 15px"></div>

## 3d.  Fix other columns one-by-one

### Fix `gender` related columns

In [None]:
standardize_gender_cols(cdr)

### Fix: `were_the_charges` 

In [None]:
pd.crosstab(cdr['were_the_charges'], cdr.form_version)

In [None]:
replacements = {
    'CAPITAL MURDER': 'CONVICTED',
    'PROBATION/PAROLE': 'PROBATION/PAROLE VIOLATION',
    'A PROBATION/PAROLE VIOLATION': 'PROBATION/PAROLE VIOLATION',
}
cdr['were_the_charges'] = cdr['were_the_charges'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['were_the_charges'], cdr.form_version)

### Fix `type_of_custody`

In [None]:
pd.crosstab(cdr['type_of_custody'], cdr.form_version)

In [None]:
replacements = {
    'PRE-CUSTODIAL USE OF FORCE': 'POLICE CUSTODY (PRE-BOOKING)',
    'PRIVATE CORRECTIONAL FACILITY': 'PRIVATE FACILITY',
    'COUNTY JAIL': 'JAIL - COUNTY',
    'MUNICIPAL JAIL': 'JAIL - MUNICIPAL',
    'PENITENTIARY': 'PRISON',
}
cdr['type_of_custody'] = cdr['type_of_custody'].apply(lambda x: None if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['type_of_custody'], cdr.form_version)

### Fix `specific_type_of_custody_facility`

In [None]:
pd.crosstab(cdr.specific_type_of_custody_facility, cdr.form_version)

In [None]:
replacements = {
    'TDCJ, SPECIFY': 'TDCJ',
    'CUSTODY OF PEACE OFFICER DURING/FLEEING ARREST': 'CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FLEEING ARREST',
    'CUSTODY OF PEACE OFFICER SUBSEQUENT TO ARREST': 'CUSTODY OF LAW ENFORCEMENT PERSONNEL AFTER ARREST',
    'CUSTODY OF LAW ENFORCEMENT PERSONNEL SUBSEQUENT TO ARREST': 'CUSTODY OF LAW ENFORCEMENT PERSONNEL AFTER ARREST',
    'TEXAS-JUVENILE JUSTICE DEPARTMENT - FACILITY/DETENTION CENTER, SPECIFY': 'OTHER',
    'TJPC': 'OTHER',
    'TYC': 'OTHER',
    'HALFWAY HOUSE/RESTITUTION CENTER': 'OTHER',
    'CORRECTIONAL/REHABILITATION FACILITY': 'OTHER',
    'NON-LAW ENFORCEMENT DETOX FACILITY': 'OTHER',
}
cdr['specific_type_of_custody_facility'] = cdr['specific_type_of_custody_facility'].apply(
    lambda x: x if pd.isnull(x) else replacements.get(x.strip(), x))
pd.crosstab(cdr['specific_type_of_custody_facility'], cdr.form_version)

### Glance at the types of custody crossed with facility

In [None]:
pd.crosstab(cdr.specific_type_of_custody_facility, cdr.type_of_custody)

### `other_behavior` and `specify_other_behavior` are very annoying and misused

In [None]:
s = cdr['other_behavior'].value_counts()
print("%d records have 'other_behavior' set (%d do not)" % (s.sum(), len(cdr) - s.sum()))
print("%d unique values for 'other_behavior' - sample below:" % len(set(s)))
s.head(10)

In [None]:
s = cdr['specify_other_behavior'].value_counts()
print("%d records have 'specify_other_behavior' set (%d do not)" % (s.sum(), len(cdr) - s.sum()))
print("%d unique values for 'specify_other_behavior' - sample below:" % len(set(s)))
s.head(10)

### It _appears_ that anytime other_behavior is 1.0, specify_other_behavior is set:

In [None]:
cdr[cdr['specify_other_behavior'].notnull()]['other_behavior'].fillna('(None)').value_counts()

### Alright, we can just overwrite 'other_behavior' with 'specify_other_behavior' when the latter is given

In [None]:
behavior = []
for other, specify in zip(cdr['other_behavior'], cdr['specify_other_behavior']):
    if pd.notnull(specify):
        behavior.append(specify)
    else:
        try:
            other = float(other)
            if other == 0.0:
                behavior.append(None)
                continue
        except TypeError:
            pass
        except ValueError:
            pass
        behavior.append(other)

cdr['other_behavior'] = pd.Series(behavior, index=cdr.index)
cdr.drop('specify_other_behavior', axis=1, inplace=True)

cdr['other_behavior'].fillna('(None)').value_counts().head(10)

<div style="background-color: #008;height: 15px"></div>

## 3e. Identify and drop a range of unnecessary columns

### `entry_date_time_n_a` and `custody_date_na` are useless - we can infer them from (duh) an NA value in entry_date_time or custody_date

In [None]:
cdr.groupby([cdr.entry_date_time.isnull(), cdr.entry_date_time_n_a]).size().unstack()

In [None]:
pd.crosstab(cdr.custody_date_na, cdr['date_time_of_custody_or_incident'].isnull())

In [None]:
cdr.drop(['entry_date_time_n_a', 'custody_date_na'], axis=1, inplace=True)

### 'department_type' is not consistently present. Drop it.

In [None]:
pd.crosstab(cdr.department_type.notnull(), cdr.form_version)

In [None]:
cdr.drop('department_type', axis=1, inplace=True)

<div style="background-color: #080;height: 20px"></div>

# 4. De-duplicate (oh boy, this is a doozy)

### In theory we should able to de-dup by the record id (`cdr_cdr_name`), BUT...
1. Sometimes two records with the same record ID differ on some columns. _Usually_ the version_number can tell us which one is the latest, but not always.
1. Sometimes we'll see multiple records for the same person with different record IDs (ugh).

...so it's complicated.

### After a LOT of tinkering, I think we can catch most reasonable duplicates with the following:
  * Drop all records that are a complete, exact duplicate of another (easy step)
  * Merge all records sharing a record id, `cdr_cdr_name`
  * Merge all records with the same name and date of birth
  * Merge all records with the same name and date of death

### 4a. Get rid of totally, utterly duplicate rows

In [None]:
pure_dups = cdr.duplicated()
cdr = cdr[~pure_dups]
print("Dropping %d rows that are 100%% duplicates of another row, leaving %d rows" % (pure_dups.sum(), len(cdr)))

### 4b. Define some dedup functions to help us

In [None]:
def merge_dup_records(frame):
    '''Master merge function. Creates one record from several that are known duplicates.'''
    # Ignore BJS records (these are from and old data dump),
    # unless there is no other option.
    form_versions_seen = set(frame['form_version'])
    if 'V_BJS' in form_versions_seen and len(form_versions_seen) > 1:
        frame = frame[frame['form_version'] != 'V_BJS']
        if len(frame) == 1:
            return frame.iloc[0], 'Keeping the only non-BJS record'
    
    
    # If one record has a higher version_number than the rest, keep that one.
    # If one record has a more recent report_date than the rest, keep that one.
    max_cols = ['version_number', 'report_date']
    for c in max_cols:
        maxval = frame[c].max()  # Implicitly ignores missing values, unless only missing values exist
        if pd.notnull(maxval):
            frame = frame[frame[c] == maxval]
            if len(frame) == 1:
                return frame.iloc[0], 'Keeping the record with greatest %s' % c

    # Otherwise, there's no way to flag the "one" right record (that I know of).
    # So we gotta merge them somehow...
    merged_rec = pd.Series(index = frame.columns,
                           name = 1000000 + frame.index[0])  # Give it a new, unique index
    awk = False
    for c in frame.columns:
        notnull = frame[c][frame[c].notnull()]

        # If all records have NA for this column, leave it as NA
        if len(notnull) == 0:
            merged_rec[c] = frame[c].iloc[0]
            continue

        # Only 1 unique not-null value? Keep that one.
        if len(notnull) == 1 or len(set(notnull)) == 1:
            merged_rec[c] = notnull.iloc[0]
            continue
        
        # Are we trying to merge record IDs? That's impossible anyway,
        # let's just concatenate them.
        if c == 'cdr_cdr_name':
            merged_rec[c] = '-'.join(notnull)
            continue
        
        # Well, poop. Multiple unique values for this column.
        # Take the most popular one ¯\_(ツ)_/¯
        # (Which will just be a random one if there's a tie ¯\_(ツ)_/¯ )
        awk = True
        vc = notnull.value_counts()
        keeper = vc.index[0]
        if vc.iloc[0] > vc.iloc[1]:
            print("  > Problem with column %s, keeping the most popular value, '%s'"
                  % (c, keeper), notnull.values)
        else:
            print("  > Problem with column %s, keeping an arbitrary tied-for-most-popular value, '%s'"
                  % (c, keeper), notnull.values)
        merged_rec[c] = keeper

    merged_rec['cdr_cdr_name'] = 'MERGED-DUPLICATES-%s' % merged_rec['cdr_cdr_name']
    if awk:
        return merged_rec, 'Merged awkwardly'
    else:
        return merged_rec, 'Merged smoothly enough'

In [None]:
def dedup_cdr_by(cdr, cols):
    '''Given a cdr dataframe, and a set of columns to use to identify duplicates, dedups/merges as needed.'''
    dups = cdr[cdr.duplicated(subset=cols, keep=False)]
    if not len(dups):
        return cdr
    unmerged_frames = []
    merged_records = []
    merge_methods = []
    for _, frame in dups.groupby(cols):
        rec, meth = merge_dup_records(frame)
        unmerged_frames.append(frame)
        merged_records.append(rec)
        merge_methods.append(meth)
        if 'awkward' in meth:
            print("...awkward merge complete for records at indices", frame.index)

    return cdr.drop(dups.index).append(merged_records), unmerged_frames, merged_records, merge_methods

### 4c. Dedup!

In [None]:
dedup_rounds = [
    ['cdr_cdr_name'],
    ['first_name', 'last_name', 'date_of_birth'],
    ['first_name', 'last_name', 'death_date'],
]
all_cdrs = [cdr]
all_merged_records = []
all_unmerged_frames = []
all_merge_methods = []
for i, dr in enumerate(dedup_rounds):
    print('**** Dedup step %d: find duplicates on these columns:' % (i + 1), dr)
    vals = dedup_cdr_by(all_cdrs[-1], dr)
    new_cdr, umf, mr, mm = vals
    all_cdrs.append(new_cdr)
    all_unmerged_frames.append(umf)
    all_merged_records.append(mr)
    all_merge_methods.append(mm)
    dropping = sum(len(f) for f in umf)
    print("Dropping %d duplicates and adding %d merged records, yielding %d records" % (
        dropping, len(mr), len(new_cdr)))

In [None]:
print("Ultimately removing %d duplicate records, leaving %d" % (
    len(all_cdrs[0]) - len(all_cdrs[-1]), len(all_cdrs[-1])))
cdr = all_cdrs[-1]

### 4d. If you want to inspect the merges (what records were merged into what), use the code below

In [None]:
for rd, cols in enumerate(dedup_rounds):
    print("In round %d, there were %d record merges based on" % (rd, len(all_merged_records[rd])), cols)

In [None]:
def show_merge(merge_round, idx):
    umf = all_unmerged_frames[merge_round][idx]
    mr = all_merged_records[merge_round][idx]
    renamed = mr.copy()
    renamed.name = 'merged'
    return umf.append(renamed)

In [None]:
show_merge(2, 1)

<div style="background-color: #080;height: 20px"></div>

# 5. Add new columns, rename/reorder, otherwise finalize dataframe

<div style="background-color: #008;height: 15px"></div>

## 5a. New columns and bookkeeping columns

### Column for the time (in days) between incarceration (or incident) and death
Note: for a few of these, the death date/time is before the incarceration/incident date. If they are only one day apart, we just call it 0 and assume it was a minor error. If they are more than a day apart, clearly there was a larger error, so we use a NaN value.

In [None]:
cdr.date_time_of_custody_or_incident.dtype

In [None]:
def get_days(dt):
    if dt.days < -1:
        return None
    elif dt.days == -1:
        return 0
    else:
        return dt.days

delta = cdr.death_date_and_time - cdr.date_time_of_custody_or_incident
print("For %d records with death date before custoday date, setting the days_from_custody_to_death to NaN" % (delta.dt.days < -1).sum())
cdr['days_from_custody_to_death'] = delta.apply(get_days)

In [None]:
cdr['suffix']

In [None]:
cdr['name_full'] = ''
for col in ['first_name', 'middle_name', 'last_name']:
    cdr['name_full'] = cdr['name_full'] + ' ' + cdr[col].fillna('')
cdr['name_full'] = cdr['name_full'].apply(lambda s: ' '.join(s.strip().split()))
cdr.loc[cdr['name_full'] == '', 'name_full'] = np.nan

### Tweak bookkeping columns

In [None]:
cdr['num_revisions'] = cdr['version_number'] - 1
cdr.drop(['version_type', 'version_number'], axis=1, inplace=True)

<div style="background-color: #008;height: 15px"></div>

## 5b. Rename and reorder columns sensibly

In [None]:
col_renames = {
    'first_name': 'name_first',
    'middle_name': 'name_middle',
    'last_name': 'name_last',
    'suffix': 'name_suffix',
    'cdr_cdr_name': 'record_id',
    'death_causer_other': 'who_caused_death_in_homicide_or_accident_other',
    'who_caused_the_death': 'who_caused_death_in_homicide_or_accident',
    'death_location': 'death_location_type',
    'death_location_elsewhere': 'death_location_type_other',
    'city': 'death_location_city',
    'county': 'death_location_county',
    'street_address': 'death_location_street_address',
    'entry_date_time': 'facility_entry_date_time',
    'pre_existing_medical_condition': 'death_from_pre_existing_medical_condition',
}

In [None]:
cdr.rename(columns=col_renames, inplace=True)

In [None]:
new_order = [
    # Record indexing columns
    'record_id',
    'num_revisions',
    'form_version',
    'report_date',
    'date_time_of_custody_or_incident',

    # Deceased personal information, demographics
    'name_first',
    'name_last',
    'name_middle',
    'name_suffix',
    'name_full',
    'date_of_birth',
    'age_at_time_of_death',
    'sex',
    'race',

    # Death event information
    'death_date',
    'death_date_and_time',
    'death_location_county',
    'death_location_city',
    'death_location_street_address',
    'death_location_type',
    'death_location_type_other',
    'death_from_pre_existing_medical_condition',
    'manner_of_death',
    'manner_of_death_description',
    'means_of_death',
    'means_of_death_other',
    'medical_cause_of_death',
    'medical_examinor_coroner_evalution',
    'medical_treatment',
    'days_from_custody_to_death',
    'who_caused_death_in_homicide_or_accident',
    'who_caused_death_in_homicide_or_accident_other',

    # Criminal information on deceased
    'offense_1',
    'offense_2',
    'offense_3',
    'type_of_offense',
    'type_of_offense_other',
    'were_the_charges',

    # Facility and agency information
    'facility_entry_date_time',
    'type_of_custody',
    'specific_type_of_custody_facility',
    'agency_address',
    'agency_city',
    'agency_county',
    'agency_name',
    'agency_zip',
    
    # Deceased behavior upon entry or custody
    'type_of_restraint',
    'under_restraint',
    'entry_behavior',
    'other_behavior',
    'exhibit_any_medical_problems',
    'exhibit_any_mental_health_problems',
    'make_suicidal_statements',
]

cdr = reorder_columns_and_check(cdr, new_order)

<div style="background-color: #008;height: 15px"></div>

## 5c. Final look before writing (NO CHANGES from here until writing)

In [None]:
cdr.sample(10)

In [None]:
show_notnull_chart(cdr)

In [None]:
print(len(cdr))

In [None]:
cdr.form_version.value_counts()

In [None]:
pd.crosstab(cdr.death_date.apply(lambda dt: dt.year), cdr.form_version)

### One more duplication sanity check -- any CDRs with the same name since 2018?

In [None]:
f = cdr[cdr.death_date.apply(lambda dt: dt.year) >= 2018]
vc = f['name_full'].value_counts()
vc[vc > 1]

### Looking at these cases, they seem fine. Just name coincidences.

In [None]:
f[f['name_full'] == 'RENE GARCIA']

In [None]:
f[f['name_full'] == 'JOSE RIOS']

<div style="background-color: #080;height: 20px"></div>

# 6. Write

In [None]:
if os.environ.get('CLEAN_CDR_DW') != 'TRUE':
    print("Not syncing to Data.world. To do so, set CLEAN_CDR_DW to 'TRUE'")
else:
    with dw.open_remote_file(OUTPUT_DW_PROJECT, OUTPUT_FILENAME) as w:
        print("Writing to data.world:", OUTPUT_FILENAME)
        cdr.to_csv(w, index=False)
        

In [None]:
if os.environ.get('CLEAN_CDR_S3') != 'TRUE':
    print("Not writing to s3. To do so, set CLEAN_CDR_S3 to 'TRUE'")
else:
    csv_buffer = StringIO()
    cdr.to_csv(csv_buffer, index=False)
    s3_resource = boto3.resource('s3')
    s3_resource.Object('tji-public-cleaned-datasets', OUTPUT_FILENAME).put(Body=csv_buffer.getvalue())


<div style="background-color: #A0A;height: 20px"></div>

# 7. TODOs -- problems that still exist in the data and need work

### [TODO] `age_at_time_of_death` and `date_of_birth` can sometimes be impossible

In [None]:
cdr.age_at_time_of_death[cdr.age_at_time_of_death < 0].value_counts().sort_index()

In [None]:
print("%d records have a birth date AFTER death date" % (cdr.date_of_birth > cdr.death_date).sum())

### [TODO] `type_of_offense` is a semicolon-separated list. Break this into multiple binary columns? Or reduce to only the most serious type? Unclear what to do if want to actually USE this column

In [None]:
s = cdr.type_of_offense
print("%d unique values for type_of_offense. Sample below:" % len(set(cdr.type_of_offense)))
s.value_counts().head(25)

In [None]:
s = cdr.type_of_offense_other
print("%d unique values for type_of_offense_other. Sample below:" % len(set(cdr.type_of_offense_other)))
s.value_counts().head(10)

### [TODO] Manner of death is sometimes 'OTHER' when it should not be
Two columns detail the manner of death: `manner_of_death` and `manner_of_death_description`. A handful of rows have `manner_of_death` = `OTHER` when they should have picked `ACCIDENT` or `HOMICIDE` -- as evinced by the information in the `manner_of_death_description` column:

In [None]:
frame = cdr[(cdr.manner_of_death == OTHER_SPECIFY) & (cdr.manner_of_death_description.notnull())]
for exempt in ('PENDING', 'DETERMIN'):
    frame = frame[~frame.manner_of_death_description.str.contains(exempt)]
print(len(frame), len(cdr))
frame[['form_version', 'manner_of_death_description', 'manner_of_death', 'medical_cause_of_death']]

### [TODO] (Related) Extract motor vehicle accidents into a unique type?

In [None]:
# This starter code should get it going
frame = cdr[(cdr.manner_of_death != 'ACCIDENTAL')]
frame = frame[(frame.medical_cause_of_death.fillna('').str.contains('ACCIDENT')) |
              frame.manner_of_death_description.fillna('').str.contains('ACCIDENT')]
print(len(frame))
frame[['medical_cause_of_death', 'means_of_death', 'means_of_death_other', 'manner_of_death', 'manner_of_death_description']]

### [TODO] Remove `medical_examinor_coroner_evalution`? Not sure it's useful.

In [None]:
pd.crosstab(cdr.medical_examinor_coroner_evalution, cdr.form_version).sort_values(['V_2005'],ascending=False)

### [TODO] Somehow blend the `entry_behavior`/`other_behavior` field from V_2005 with the `exhibit_any_medical_problems`, `exhibit_any_mental_health_problems` and `make_suicidal_statements` from V_2016

In [None]:
pd.crosstab(cdr.entry_behavior, cdr.form_version).sort_values(['V_2005'],ascending=False)

In [None]:
pd.crosstab(cdr.exhibit_any_medical_problems, cdr.form_version).sort_values(['V_2016'],ascending=False)

In [None]:
pd.crosstab(cdr.exhibit_any_mental_health_problems, cdr.form_version).sort_values(['V_2016'],ascending=False)

In [None]:
pd.crosstab(cdr.make_suicidal_statements, cdr.form_version).sort_values(['V_2016'],ascending=False)

### [TODO] Align various binary behavioral columns across form versions so we can keep them?

In [None]:
# These columns were completely removed. Some can stay once paired off
drop_cols