# Generate a minimalist, compressed version of CDR/OIS data for our website's explore-the-data app

### NOTE: This is a temporary file, only existing until this code lives in a cron job somewhere

### Purpose of this notebook

This notebook generates the data files to that our explore the data page uses. There are two for each dataset:
1. A compressed file with just enough data to show the charts, e.g. `cdr_compressed.json`
2. The full CSV file, with all columns, in the same order as the compressed file.

Before running this notebook, edit the first cell to indicate which dataset you want to process, and where you want the resulting files to end up. If you want to process multiple datasets, you'll have to rerun this notebook several times with different settings in the first cell.

### About the compressed file

Say we have a set or records like this:
```
   [
      {"sex": "M", "race": "WHITE", "record_id": "PA111"},
      {"sex": "M", "race": "HISPANIC", "record_id": "PA222"},
      {"sex": "F", "race": "BLACK", "record_id": "PA333"},
      {"sex": "F", "race": null, "record_id": "PA444"},
   ]
```

We will compress them to look like this:
```
    {
      meta: {
        lookups: {
          "sex": ["F", "M"],
          "race": ["HISPANIC", "WHITE", "BLACK"]
        },
        'record_ids': {
            'field_name': 'record_id',
            'values': ["PA111", "PA222", "PA333", "PA444"]
        }
        num_records: 3,
        num_columns: 2
      },
      records: {
        "sex": [1, 1, 0, 0],
        "race": [1, 0, 2, -1]
      },
    }
```
Note that the 'records' object above contains indices in the lookup array for that column.  The value is -1 for missing values.

In practice, this cuts our data size down dramatically by avoiding repeated keys or repeating long string values.

##### Author: Everett Wetchler (everett.wetchler@gmail.com)

## \** CONFIGURE THIS CELL **

In [1]:
OUTFOLDER = './'  # Where to write the resulting files
DATASET = 'cdr'  # Key into CONFIGS object below

## Edit this if you want to tweak what data ends up in the compressed file

In [14]:
CONFIGS = {
    'cdr': {
        'DTW_PROJECT_KEY': 'tji/deaths-in-custody',
        'DTW_FILENAME': 'cleaned_custodial_death_reports',
        'OUTFILE_PREFIX': 'cdr',
        'DATE_COL': 'death_date',
        'ID_COL': 'record_id',
        'KEEP_COLS': [
            'record_id', 'year', 'race', 'sex', 'manner_of_death', 'age_at_time_of_death',
            'type_of_custody', 'death_location_type', 'means_of_death', 'death_location_county', 'agency_county'
        ]
    },
    'ois': {
        'DTW_PROJECT_KEY': 'tji/officer-involved-shootings',
        'DTW_FILENAME': 'shot_civilians',
        'OUTFILE_PREFIX': 'ois',
        'DATE_COL': 'date_incident',
        'ID_COL': None,
        'KEEP_COLS': [
            'year', 'civilian_race', 'civilian_gender', 'civilian_age', 'civilian_died',
            'officer_age_1', 'officer_race_1', 'officer_gender_1', 'incident_result_of',
            'incident_county', 'agency_county_1', 'deadly_weapon', 'deadly_weapon_category',
            'multiple_officers_involved'
        ],
        'RENAMES': {
            'officer_gender_1': 'officer_gender',
            'officer_age_1': 'officer_age',
            'officer_race_1': 'officer_race',
            'agency_county_1': 'agency_county',
        }
    }
}

In [28]:
config = CONFIGS[DATASET]

## Import and go

In [2]:
import datadotworld as dw
import numpy as np
import pandas as pd
import simplejson as json

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

%load_ext watermark
%watermark -a 'Everett Wetchler' -d -t -z -w -p numpy,pandas,datadotworld

Everett Wetchler 2018-07-08 14:38:36 CDT

numpy 1.14.5
pandas 0.23.1
datadotworld 1.6.0
watermark 1.6.1


In [29]:
datasets = dw.load_dataset(config['DTW_PROJECT_KEY'], force_update=True)
df = datasets.dataframes[config['DTW_FILENAME']]
df.head()

Unnamed: 0,date_incident,incident_address,incident_city,incident_county,incident_zip,incident_longitude,incident_latitude,incident_census_tract,incident_result_of,incident_call_other,civilian_name_first,civilian_name_last,civilian_name_full,civilian_gender,civilian_age,civilian_race,civilian_died,officer_age_1,officer_race_1,officer_gender_1,agency_name_1,agency_report_date_1,agency_city_1,agency_county_1,agency_zip_1,agency_name_person_filling_out_1,agency_email_person_filling_out_1,officer_age_2,officer_race_2,officer_gender_2,officer_caused_injury_2,agency_name_2,agency_report_date_2,agency_city_2,agency_county_2,agency_zip_2,agency_name_person_filling_out_2,agency_email_person_filling_out_2,officer_age_3,officer_race_3,officer_gender_3,officer_caused_injury_3,agency_name_3,agency_report_date_3,agency_city_3,agency_county_3,agency_zip_3,agency_name_person_filling_out_3,agency_email_person_filling_out_3,officer_age_4,...,agency_name_person_filling_out_7,agency_email_person_filling_out_7,officer_age_8,officer_race_8,officer_gender_8,agency_name_8,agency_report_date_8,agency_city_8,agency_county_8,agency_zip_8,agency_name_person_filling_out_8,agency_email_person_filling_out_8,officer_age_9,officer_race_9,officer_gender_9,agency_name_9,agency_report_date_9,agency_city_9,agency_county_9,agency_zip_9,agency_name_person_filling_out_9,agency_email_person_filling_out_9,officer_age_10,officer_race_10,officer_gender_10,agency_name_10,agency_report_date_10,agency_city_10,agency_county_10,agency_zip_10,agency_name_person_filling_out_10,agency_email_person_filling_out_10,news_coverage_1,news_coverage_2,news_coverage_3,news_coverage_4,cdr_narrative,custodial_death_report,deadly_weapon,deadly_weapon_category,deadly_weapon_description,lea_narrative_published,lea_narrative_shorter,multiple_officers_involved,num_officers_recorded,num_reports_filed,num_rows_about_this_incident,officer_on_duty,time_incident,incident_location
0,2015-09-02,1010 MAGNOLIA STREET,FREEPORT,BRAZORIA,77541.0,-95.356262,28.944891,664400.0,TRAFFIC STOP,NARCOTIC STOP AND EVADING ARREST,RICKEY,MAYBERRY,RICKEY MAYBERRY,M,30.0,BLACK,False,27.0,HISPANIC,M,FREEPORT POLICE DEPT,9/16/2015,FREEPORT,BRAZORIA,77541,PAMELA MORRIS,PMORRIS@FREEPORT.TX.US,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,NaT,,,,,,,,,,NaT,,,,,,,,,,NaT,,,,,,ABC 13,YOUR SOUTHEST TEXAS,,,,False,False,,,,,True,1,1,1,True,,POINT(-95.3562616 28.9448909)
1,2015-09-03,4840 E. PLANO PARKWAY,PLANO,COLLIN,75074.0,-96.642308,33.008128,32010.0,OTHER - SPECIFY TYPE OF CALL,ACCIDENTAL DISCHARGE RICOCHET DURING RANGE ACT...,,,,M,55.0,WHITE,False,30.0,HISPANIC,M,PLANO POLICE DEPT,10/1/2015,PLANO,COLLIN,75074,CURTIS HOWARD,CURTISH@PLANO.GOV,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,NaT,,,,,,,,,,NaT,,,,,,,,,,NaT,,,,,,,,,,,False,True,(DETAILS MISSING),,,,False,1,1,1,True,,POINT(-96.6423077 33.0081284)
2,2015-09-04,101 COUCH CT.,SPRINGTOWN,PARKER,76082.0,-97.634194,32.916724,140409.0,OTHER - SPECIFY TYPE OF CALL,INVESTIGATION OF CRIMINAL ACTIVITY,SULLY JOE,LANIER,SULLY JOE LANIER,M,36.0,WHITE,True,27.0,WHITE,M,PARKER CO SHERIFFS OFFICE,10/6/2015,WEATHERFORD,PARKER,76086,MEREDITH GRAY,MEREDITH.GRAY@PARKERCOUNTYTX.COM,56.0,WHITE,M,,,,,,,,,,,,,,,,,,,,,...,,,,,,,NaT,,,,,,,,,,NaT,,,,,,,,,,NaT,,,,,,WFAA,DFW CBS LOCAL,STAR TELEGRAM,FOX 4 NEWS,DECEDENT SHOT A RIFLE AT LE OFFICERS WHOM RETU...,True,True,FIREARM,FIREARM,,FIRED AT OFFICERS,True,2,1,1,True,,POINT(-97.6341935 32.9167242)
3,2015-09-05,4926 CHENNAULT ROAD,HOUSTON,HARRIS,77033.0,-95.344966,29.681655,332100.0,EMERGENCY CALL OR REQUEST FOR ASSISTANCE,,,,,M,21.0,BLACK,False,28.0,WHITE,M,HOUSTON POLICE DEPT,9/11/2015,HOUSTON,HARRIS,77002,ODON BELMAREZ,ODON.BELMAREZ@HOUSTONPOLICE.ORG,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,NaT,,,,,,,,,,NaT,,,,,,,,,,NaT,,,,,,CHRON,ABC 13,CLICK 2 HOUSTON,,,False,True,FIREARM,FIREARM,AN OFFICER WAS DISPATCHED TO A WEAPONS DISTURB...,,True,1,1,1,True,,POINT(-95.3449662 29.6816553)
4,2015-09-08,1500 RANGE ROAD,"DFW, AIRPORT",TARRANT,75261.0,-97.040335,32.899809,980000.0,OTHER - SPECIFY TYPE OF CALL,TRAINING EXERCISE - RICOCHET FRAGMENTS RESULTI...,,,,M,44.0,WHITE,False,38.0,HISPANIC,M,IRVING POLICE DEPT,10/15/2015,IRVING,DALLAS,75061,MICHAEL COLEMAN,MCOLEMAN@CITYOFIRVING.COM,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,NaT,,,,,,,,,,NaT,,,,,,,,,,NaT,,,,,,,,,,,False,False,,,,,False,1,1,1,True,,POINT(-97.0403352 32.8998091)


In [32]:
slim = df.copy()
slim['year'] = pd.to_datetime(slim[config['DATE_COL']]).dt.year

In [34]:
slim = slim[config['KEEP_COLS']]
slim.columns = [config['RENAMES'].get(c, c) for c in slim.columns]
slim.head()

Unnamed: 0,year,civilian_race,civilian_gender,civilian_age,civilian_died,officer_age,officer_race,officer_gender,incident_result_of,incident_county,agency_county,deadly_weapon,deadly_weapon_category,multiple_officers_involved
0,2015,BLACK,M,30.0,False,27.0,HISPANIC,M,TRAFFIC STOP,BRAZORIA,BRAZORIA,False,,True
1,2015,WHITE,M,55.0,False,30.0,HISPANIC,M,OTHER - SPECIFY TYPE OF CALL,COLLIN,COLLIN,True,(DETAILS MISSING),False
2,2015,WHITE,M,36.0,True,27.0,WHITE,M,OTHER - SPECIFY TYPE OF CALL,PARKER,PARKER,True,FIREARM,True
3,2015,BLACK,M,21.0,False,28.0,WHITE,M,EMERGENCY CALL OR REQUEST FOR ASSISTANCE,HARRIS,HARRIS,True,FIREARM,True
4,2015,WHITE,M,44.0,False,38.0,HISPANIC,M,OTHER - SPECIFY TYPE OF CALL,TARRANT,DALLAS,False,,False


In [36]:
def compress(df, id_col=None):
    js = {
        'meta': {
            'num_columns': len(df.columns),
            'num_records': len(df),
            'lookups': {},
        },
        'records': {},
    }
    if id_col:
        js['meta']['record_ids'] = {
            'field_name': id_col,
            'values': list(df[id_col])
        }
        df = df.drop(id_col, axis=1)
    for col in df.columns:
        values = sorted(list(set(df[col].dropna())))
        mapping = dict((v, i) for i, v in enumerate(values))
        js['meta']['lookups'][col] = values
        js['records'][col] = df[col].apply(lambda x: -1 if pd.isnull(x) else mapping[x]).tolist()

    return js

## For convenience, we'll inspect a sample of data to see that our code is compressing correctly

In [37]:
samp = slim.sample(5)
samp

Unnamed: 0,year,civilian_race,civilian_gender,civilian_age,civilian_died,officer_age,officer_race,officer_gender,incident_result_of,incident_county,agency_county,deadly_weapon,deadly_weapon_category,multiple_officers_involved
34,2015,BLACK,M,28.0,False,41.0,WHITE,M,EMERGENCY CALL OR REQUEST FOR ASSISTANCE,KAUFMAN,KAUFMAN,True,FIREARM,True
325,2017,BLACK,M,26.0,False,34.0,HISPANIC,M,OTHER - SPECIFY TYPE OF CALL,HARRIS,HARRIS,True,FIREARM,False
321,2017,HISPANIC,M,27.0,True,36.0,WHITE,M,OTHER - SPECIFY TYPE OF CALL,HARRIS,HARRIS,True,(DETAILS MISSING),True
427,2018,HISPANIC,M,23.0,True,40.0,HISPANIC,M,"HOSTAGE, BARRICADE, OR OTHER EMERGENCY SITUATION",TRAVIS,TRAVIS,True,FIREARM,True
227,2016,HISPANIC,M,35.0,False,33.0,WHITE,M,OTHER - SPECIFY TYPE OF CALL,ECTOR,ECTOR,True,VEHICLE,False


In [38]:
samp_compressed = compress(samp, id_col=config['ID_COL'])
samp_compressed

{'meta': {'num_columns': 14,
  'num_records': 5,
  'lookups': {'year': [2015, 2016, 2017, 2018],
   'civilian_race': ['BLACK', 'HISPANIC'],
   'civilian_gender': ['M'],
   'civilian_age': [23.0, 26.0, 27.0, 28.0, 35.0],
   'civilian_died': [False, True],
   'officer_age': [33.0, 34.0, 36.0, 40.0, 41.0],
   'officer_race': ['HISPANIC', 'WHITE'],
   'officer_gender': ['M'],
   'incident_result_of': ['EMERGENCY CALL OR REQUEST FOR ASSISTANCE',
    'HOSTAGE, BARRICADE, OR OTHER EMERGENCY SITUATION',
    'OTHER - SPECIFY TYPE OF CALL'],
   'incident_county': ['ECTOR', 'HARRIS', 'KAUFMAN', 'TRAVIS'],
   'agency_county': ['ECTOR', 'HARRIS', 'KAUFMAN', 'TRAVIS'],
   'deadly_weapon': [True],
   'deadly_weapon_category': ['(DETAILS MISSING)', 'FIREARM', 'VEHICLE'],
   'multiple_officers_involved': [False, True]}},
 'records': {'year': [0, 2, 2, 3, 1],
  'civilian_race': [0, 0, 1, 1, 1],
  'civilian_gender': [0, 0, 0, 0, 0],
  'civilian_age': [3, 1, 2, 0, 4],
  'civilian_died': [0, 0, 1, 1, 0],
 

---
# Write
---

## For testing, generate a sample compressed file with just a few records (json file)

In [39]:
filename = OUTFOLDER + config['OUTFILE_PREFIX'] + '_compressed_sample.json'
print("Writing sample compressed file to", filename)
with open(filename, 'w') as f:
    f.write(json.dumps(samp_compressed, indent=2))

Writing sample compressed file to ./ois_compressed_sample.json


## Generate the full compressed dataset (json file)

In [40]:
filename = OUTFOLDER + config['OUTFILE_PREFIX'] + '_compressed.json'
print("Writing full compressed file to", filename)
with open(filename, 'w') as f:
    f.write(json.dumps(compress(slim, id_col=config['ID_COL'])))

Writing full compressed file to ./ois_compressed.json


## Also write the full, uncompressed CSV file

In [31]:
fullfile = config['OUTFILE_PREFIX'] + '_full.csv'
print("Writing full file to " + fullfile)
df.to_csv(fullfile, index=False)

Writing full file to ois_full.csv
