# File processing script

This script needs to be run before any other notebook. It uses an export from our data store and adds then some model definitions to add pretty names for things like race names, manner of death, etc.

It's set to pull the latest export of the data.

In [1]:
import agate
import warnings
warnings.filterwarnings('ignore')

## Get the data

In [2]:
ls ../data_raw/

[31mDeath-2017-06-13.csv[m[m* [31moffense_table.csv[m[m*
[31magency_table.csv[m[m*     [31moffenses_latest.csv[m[m*


In [3]:
# path to the most recent export of the Deaths model from the admin 
raw_file = '../data_raw/Death-2017-06-13.csv'

# sets data types on fields agate got wrong
specified_data_types = {
    'tracked_cause': agate.Text(),
    'offense': agate.Text(),
    'case_study': agate.Text(),
    'official_discipline': agate.Text()
}

# read in the data
deaths = agate.Table.from_csv(raw_file, column_types=specified_data_types)

# print what we got
print(deaths)

| column               | data_type |
| -------------------- | --------- |
| id                   | Number    |
| ag_report_url        | Text      |
| first_name           | Text      |
| middle_name          | Text      |
| last_name            | Text      |
| suffix               | Text      |
| slug                 | Text      |
| race                 | Text      |
| gender               | Text      |
| date_of_birth        | Date      |
| date_of_death        | Date      |
| age                  | Number    |
| agency               | Number    |
| restrained           | Boolean   |
| tazed                | Boolean   |
| times_tazed          | Number    |
| pepper_sprayed       | Boolean   |
| official_discipline  | Text      |
| grand_jury_result    | Text      |
| mental_health_issues | Boolean   |
| manner_of_death      | Text      |
| drug_intoxication    | Boolean   |
| cause_of_death       | Text      |
| tracked_cause        | Text      |
| offense              | Text      |
|

### Sets up variable names

The original deaths table uses a series of id_letters for different values, so these have to be added to the table to have pretty names.

Here I amend the same table over and over in a series of joins. This may not be ideal (i.e, [non-Groskopf-esque](http://agate.readthedocs.io/en/1.6.0/about.html#principles), but I'm at least putting them all together so they can be managed together.

In [4]:
# array matches models.py in data warehouse
RACE = (
    ('w', 'White'),
    ('b', 'Black'),
    ('h', 'Hispanic/Latino'),
    ('a', 'Asian'),
    ('n', 'Native American/Pacific Islander'),
    ('o', 'Other'),
    ('u', 'Unknown'),
)

# set columns and make table
race_column_names = ['letter', 'race_name']
race_values = agate.Table(RACE, race_column_names)

# joins race_values table to deaths to get pretty race names
deaths = deaths.join(race_values, 'race', 'letter')

MANNER = (
    ('n', 'Natural'),
    ('a', 'Accident'),
    ('h', 'Homicide'),
    ('s', 'Suicide'),
    ('u', 'Undetermined'),
)

# set columns, makes table, makes join
manner_column_names = ['letter', 'manner_name']
manner_values = agate.Table(MANNER, manner_column_names)
deaths = deaths.join(manner_values, 'manner_of_death', 'letter')

GRANDJURY = (
    ('i', 'Indictment'),
    ('n', 'No-bill'),
    ('b', 'Not brought')
)

# set columns, makes table, makes join
grandjury_column_names = ['letter', 'grand_jury_name']
grandjury_values = agate.Table(GRANDJURY, grandjury_column_names)
deaths = deaths.join(grandjury_values, 'grand_jury_result', 'letter')

DISCIPLINE = (
    ('f', 'Fired'),
    ('s', 'Suspended'),
    ('n', 'None'),
    ('r', 'No reply')
)

discipline_columns_names = ['letter', 'discipline_name']
discipline_values = agate.Table(DISCIPLINE, discipline_columns_names)
deaths = deaths.join(discipline_values, 'official_discipline', 'letter')

In [5]:
deaths.to_csv('../exports/deaths_latest.csv')