# Generate a minimalist, compressed version of CDR/OIS data for our website's explore-the-data app

### NOTE: This is a temporary file, only existing until this code lives in a cron job somewhere

### Purpose of this notebook

This notebook generates the data files to that our explore the data page uses. There are two for each dataset:
1. A compressed file with just enough data to show the charts, e.g. `cdr_compressed.json`
2. The full CSV file, with all columns, in the same order as the compressed file.

Before running this notebook, edit the first cell to indicate which dataset you want to process, and where you want the resulting files to end up. If you want to process multiple datasets, you'll have to rerun this notebook several times with different settings in the first cell.

### About the compressed file

Say we have a set or records like this:
```
   [
      {"sex": "M", "race": "WHITE", "record_id": "PA111"},
      {"sex": "M", "race": "HISPANIC", "record_id": "PA222"},
      {"sex": "F", "race": "BLACK", "record_id": "PA333"},
      {"sex": "F", "race": null, "record_id": "PA444"},
   ]
```

We will compress them to look like this:
```
    {
      meta: {
        lookups: {
          "sex": ["F", "M"],
          "race": ["HISPANIC", "WHITE", "BLACK"]
        },
        'record_ids': {
            'field_name': 'record_id',
            'values': ["PA111", "PA222", "PA333", "PA444"]
        }
        num_records: 3,
        num_columns: 2
      },
      records: {
        "sex": [1, 1, 0, 0],
        "race": [1, 0, 2, -1]
      },
    }
```
Note that the 'records' object above contains indices in the lookup array for that column.  The value is -1 for missing values.

In practice, this cuts our data size down dramatically by avoiding repeated keys or repeating long string values.

##### Author: Everett Wetchler (everett.wetchler@gmail.com)

## \** CONFIGURE THIS CELL **

In [24]:
OUTFOLDER = './'  # Where to write the resulting files
DATASET = 'cdr'  # Key into CONFIGS object below

## Edit this if you want to tweak what data ends up in the compressed file

In [25]:
CONFIGS = {
    'cdr': {
        'DTW_PROJECT_KEY': 'tji/deaths-in-custody',
        'DTW_FILENAME': 'cleaned_custodial_death_reports',
        'OUTFILE_PREFIX': 'cdr',
        'DATE_COL': 'death_date',
        'ID_COL': 'record_id',
        'KEEP_COLS': [
            'record_id', 'year', 'race', 'sex', 'manner_of_death', 'age_at_time_of_death',
            'type_of_custody', 'death_location_type', 'means_of_death', 'death_location_county', 'agency_name'
        ]
    },
    'ois': {
        'DTW_PROJECT_KEY': 'tji/officer-involved-shootings',
        'DTW_FILENAME': 'shot_civilians',
        'OUTFILE_PREFIX': 'ois',
        'DATE_COL': 'date_incident',
        'ID_COL': None,
        'KEEP_COLS': [
            'year', 'civilian_race', 'civilian_gender', 'civilian_age', 'civilian_died',
            'officer_age_1', 'officer_race_1', 'officer_gender_1', 'incident_result_of',
            'incident_county', 'agency_name_1', 'deadly_weapon',
            'multiple_officers_involved'
        ],
        'RENAMES': {
            'officer_gender_1': 'officer_gender',
            'officer_age_1': 'officer_age',
            'officer_race_1': 'officer_race',
            'agency_name_1': 'agency_name',
        }
    }
}

In [26]:
config = CONFIGS[DATASET]

## Import and go

In [27]:
import datadotworld as dw
import numpy as np
import pandas as pd
import simplejson as json

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

In [28]:
datasets = dw.load_dataset(config['DTW_PROJECT_KEY'], force_update=True)
df = datasets.dataframes[config['DTW_FILENAME']]
df.head()

Unnamed: 0,record_id,num_revisions,form_version,report_date,date_time_of_custody_or_incident,name_first,name_last,name_middle,name_suffix,name_full,date_of_birth,age_at_time_of_death,sex,race,death_date,death_date_and_time,death_location_county,death_location_city,death_location_street_address,death_location_latitude,death_location_longitude,death_location_census_tract,death_location_type,death_location_type_other,death_from_pre_existing_medical_condition,manner_of_death,manner_of_death_description,means_of_death,means_of_death_other,medical_cause_of_death,medical_examinor_coroner_evalution,medical_treatment,days_from_custody_to_death,who_caused_death_in_homicide_or_accident,who_caused_death_in_homicide_or_accident_other,offense_1,offense_2,offense_3,were_the_charges,facility_entry_date_time,type_of_custody,specific_type_of_custody_facility,agency_address,agency_city,agency_county,agency_name,agency_zip,entry_behavior,other_behavior,exhibit_any_medical_problems,exhibit_any_mental_health_problems,make_suicidal_statements
0,16-89-C,0.0,V_2005,2017-03-13 16:47:00,2016-04-26 23:00:00,DARREL,MACHEMEHL,S.,,DARREL S. MACHEMEHL,1972-01-20,44.0,M,WHITE,2016-04-26,2016-04-26 23:00:00,GUADALUPE,KINGSBURY,US 90 AT FM 1104,,,,CRIME/ARREST SCENE,,NOT APPLICABLE,"OTHER, SPECIFY",DREW HIS WEAPON ON OFFICERS AND WAS SHOT BY OF...,FIREARM,,MULTIPLE CENTER FIRE RIFLE WOUNDS,"YES, RESULTS ARE AVAILABLE",NOT APPLICABLE,0.0,LAW ENFORCEMENT/CORRECTIONAL PERSONNEL,,AGGRAVATED ROBBERY,EVADING ARREST WITH VEHICLE,,NOT FILED AT TIME OF DEATH,NaT,POLICE CUSTODY (PRE-BOOKING),CUSTODY OF LAW ENFORCEMENT PERSONNEL DURING/FL...,350 N. GUADALUPE STREET,SEGUIN,GUADALUPE,SEGUIN POLICE DEPT,78155,,WAS SEEN PUTTING A SHOTGUN IN HIS BACKPACK THE...,,,
1,PA10082P,0.0,V_2005,2010-05-11 10:22:00,2006-10-10 00:00:00,JOHN,DEJEAN,,,JOHN DEJEAN,1955-12-03,54.0,M,BLACK,2010-05-10,2010-05-10 06:15:00,GRIMES,NAVASOTA,2400 WALLACE PACK,,,,LAW ENFORCEMENT FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,VENOUS INFARCT OF THE SMALL BOWEL,NOT APPLICABLE,,VENOUS INFARCT OF THE SMALL BOWEL,"YES, RESULTS ARE AVAILABLE",NOT APPLICABLE,1308.0,NOT APPLICABLE,,THEFT,FALSE REPRESENTATION OF A LAWYER,,CONVICTED,2006-10-10 00:00:00,PRISON,TDCJ,"2503 LAKE ROAD, SUITE 5",HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77340,,,,,
2,PA10083P,0.0,V_2005,2010-05-11 10:40:00,2008-03-12 00:00:00,JOEL,LEWIS,THOMAS,,JOEL THOMAS LEWIS,1967-04-17,43.0,M,WHITE,2010-04-27,2010-04-27 18:02:00,JEFFERSON,BEAUMONT,3060 FM 3514,,,,LAW ENFORCEMENT FACILITY,,UNKNOWN,ACCIDENTAL,,NOT APPLICABLE,,HEROIN TOXICITY,"YES, RESULTS ARE AVAILABLE",NOT APPLICABLE,776.0,NOT APPLICABLE,,POSSESSION OF CONTROLLED SUBSTANCE,,,CONVICTED,2009-08-25 00:00:00,PRISON,TDCJ,"2503 LAKE ROAD, SUITE 5",HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77340,,,,,
3,PA10084CJ,0.0,V_2005,2010-05-12 11:09:00,2009-10-13 06:45:00,DENNIS,BRADFORD,EARL,,DENNIS EARL BRADFORD,1969-09-28,40.0,M,WHITE,2010-05-10,2010-05-10 02:11:00,GALVESTON,GALVESTON,5700 AVE H,,,,LAW ENFORCEMENT FACILITY,,NOT APPLICABLE,SUICIDE,,"HANGING, STRANGULATION",,ASPHYXIATION,"YES, RESULTS ARE AVAILABLE",NOT APPLICABLE,208.0,NOT APPLICABLE,,ATTEMPTED CAPITAL MURDER OF A CHILD,,,FILED,2009-10-14 23:47:00,JAIL,JAIL - SINGLE CELL,601 54TH STREET,GALVESTON,GALVESTON,GALVESTON CO SHERIFFS OFFICE,77550,,,,,
4,PA10085P,0.0,V_2005,2010-05-13 09:02:00,1990-07-18 00:00:00,MILTON,DIXON,W.,,MILTON W. DIXON,1949-11-09,60.0,M,BLACK,2010-04-01,2010-04-01 04:54:00,ANDERSON,PALESTINE,1400 FM 3452,,,,LAW ENFORCEMENT FACILITY,,UNKNOWN,NATURAL CAUSES/ILLNESS,ADENNOCARCINOMA OF DUODENUM (TERMINAL CANCER O...,NOT APPLICABLE,,TERMINAL CANCER,"NO, EVALUATION NOT PLANNED",YES,7197.0,NOT APPLICABLE,,SEXUAL ASLT,,,CONVICTED,1990-07-18 00:00:00,PRISON,TDCJ,"2503 LAKE ROAD, SUITE 5",HUNTSVILLE,STATE,TEXAS DEPT OF CRIMINAL JUSTICE,77340,,,,,


In [29]:
slim = df.copy()
slim['year'] = pd.to_datetime(slim[config['DATE_COL']]).dt.year

In [30]:
slim = slim[config['KEEP_COLS']]
slim.columns = [config.get('RENAMES', {}).get(c, c) for c in slim.columns]
slim.head()

Unnamed: 0,record_id,year,race,sex,manner_of_death,age_at_time_of_death,type_of_custody,death_location_type,means_of_death,death_location_county,agency_name
0,16-89-C,2016,WHITE,M,"OTHER, SPECIFY",44.0,POLICE CUSTODY (PRE-BOOKING),CRIME/ARREST SCENE,FIREARM,GUADALUPE,SEGUIN POLICE DEPT
1,PA10082P,2010,BLACK,M,NATURAL CAUSES/ILLNESS,54.0,PRISON,LAW ENFORCEMENT FACILITY,NOT APPLICABLE,GRIMES,TEXAS DEPT OF CRIMINAL JUSTICE
2,PA10083P,2010,WHITE,M,ACCIDENTAL,43.0,PRISON,LAW ENFORCEMENT FACILITY,NOT APPLICABLE,JEFFERSON,TEXAS DEPT OF CRIMINAL JUSTICE
3,PA10084CJ,2010,WHITE,M,SUICIDE,40.0,JAIL,LAW ENFORCEMENT FACILITY,"HANGING, STRANGULATION",GALVESTON,GALVESTON CO SHERIFFS OFFICE
4,PA10085P,2010,BLACK,M,NATURAL CAUSES/ILLNESS,60.0,PRISON,LAW ENFORCEMENT FACILITY,NOT APPLICABLE,ANDERSON,TEXAS DEPT OF CRIMINAL JUSTICE


In [31]:
def compress(df, id_col=None):
    js = {
        'meta': {
            'num_columns': len(df.columns),
            'num_records': len(df),
            'lookups': {},
        },
        'records': {},
    }
    if id_col:
        js['meta']['record_ids'] = {
            'field_name': id_col,
            'values': list(df[id_col])
        }
        df = df.drop(id_col, axis=1)
    for col in df.columns:
        values = sorted(list(set(df[col].dropna())))
        mapping = dict((v, i) for i, v in enumerate(values))
        js['meta']['lookups'][col] = values
        js['records'][col] = df[col].apply(lambda x: -1 if pd.isnull(x) else mapping[x]).tolist()

    return js

## For convenience, we'll inspect a sample of data to see that our code is compressing correctly

In [32]:
samp = slim.sample(5)
samp

Unnamed: 0,record_id,year,race,sex,manner_of_death,age_at_time_of_death,type_of_custody,death_location_type,means_of_death,death_location_county,agency_name
1630,PA05094MJ,2005,WHITE,M,SUICIDE,47.0,JAIL,LAW ENFORCEMENT FACILITY,"HANGING, STRANGULATION",BEXAR,UNIVERSAL CITY POLICE DEPT
5522,16-16-P,2016,BLACK,M,NATURAL CAUSES/ILLNESS,62.0,PRISON,MEDICAL FACILITY,NOT APPLICABLE,GALVESTON,TEXAS DEPT OF CRIMINAL JUSTICE
2721,PA15599C,2015,BLACK,M,"OTHER, SPECIFY",30.0,POLICE CUSTODY (PRE-BOOKING),CRIME/ARREST SCENE,FIREARM,HARRIS,HARRIS CO SHERIFFS OFFICE
447,PA09042P,2009,BLACK,M,NATURAL CAUSES/ILLNESS,53.0,PRISON,LAW ENFORCEMENT FACILITY,NOT APPLICABLE,LIBERTY,TEXAS DEPT OF CRIMINAL JUSTICE
736,PA12053C,2012,WHITE,M,HOMICIDE,52.0,POLICE CUSTODY (PRE-BOOKING),CRIME/ARREST SCENE,FIREARM,DENTON,DALLAS POLICE DEPT


In [33]:
samp_compressed = compress(samp, id_col=config['ID_COL'])
samp_compressed

{'meta': {'lookups': {'age_at_time_of_death': [30.0, 47.0, 52.0, 53.0, 62.0],
   'agency_name': ['DALLAS POLICE DEPT',
    'HARRIS CO SHERIFFS OFFICE',
    'TEXAS DEPT OF CRIMINAL JUSTICE',
    'UNIVERSAL CITY POLICE DEPT'],
   'death_location_county': ['BEXAR',
    'DENTON',
    'GALVESTON',
    'HARRIS',
    'LIBERTY'],
   'death_location_type': ['CRIME/ARREST SCENE',
    'LAW ENFORCEMENT FACILITY',
    'MEDICAL FACILITY'],
   'manner_of_death': ['HOMICIDE',
    'NATURAL CAUSES/ILLNESS',
    'OTHER, SPECIFY',
    'SUICIDE'],
   'means_of_death': ['FIREARM', 'HANGING, STRANGULATION', 'NOT APPLICABLE'],
   'race': ['BLACK', 'WHITE'],
   'sex': ['M'],
   'type_of_custody': ['JAIL', 'POLICE CUSTODY (PRE-BOOKING)', 'PRISON'],
   'year': [2005, 2009, 2012, 2015, 2016]},
  'num_columns': 11,
  'num_records': 5,
  'record_ids': {'field_name': 'record_id',
   'values': ['PA05094MJ', '16-16-P', 'PA15599C', 'PA09042P', 'PA12053C']}},
 'records': {'age_at_time_of_death': [1, 4, 0, 3, 2],
  'agen

---
# Write
---

## For testing, generate a sample compressed file with just a few records (json file)

In [34]:
filename = OUTFOLDER + config['OUTFILE_PREFIX'] + '_compressed_sample.json'
print("Writing sample compressed file to", filename)
with open(filename, 'w') as f:
    f.write(json.dumps(samp_compressed, indent=2))

Writing sample compressed file to ./cdr_compressed_sample.json


## Generate the full compressed dataset (json file)

In [35]:
filename = OUTFOLDER + config['OUTFILE_PREFIX'] + '_compressed.json'
print("Writing full compressed file to", filename)
with open(filename, 'w') as f:
    f.write(json.dumps(compress(slim, id_col=config['ID_COL'])))

Writing full compressed file to ./cdr_compressed.json


## Also write the full, uncompressed CSV file

In [36]:
fullfile = OUTFOLDER + config['OUTFILE_PREFIX'] + '_full.csv'
print("Writing full file to " + fullfile)
df.to_csv(fullfile, index=False)

Writing full file to ./cdr_full.csv
