# Clean the Texas OIS dataset for analysis -- officers shot only

### Latest run covers incidents from 2015-10-06 to 2018-03-25

* Inputs:
   * `OIS.xlsx` (currently local -- TODO add to data.world)
   * `texas_law_enforcement_agencies_and_counties.csv` (dtw - used to add county information)
* Output: `shot_officers.csv`

##### Author: Everett Wetchler (everett.wetchler@gmail.com)

## 1. Setup and read data

In [1]:
DTW_PROJECT_KEY = 'tji/auxiliary-datasets'
RAW_FILENAME = 'data/raw/OIS.xlsx'
CLEANED_FILENAME = 'data/clean/shot_officers.csv'
AGENCY_COUNTY_DATAFRAME_NAME = 'texas_law_enforcement_agencies_and_counties'

In [2]:
import datadotworld as dw
import numpy as np
import pandas as pd

%load_ext watermark
%watermark -a "Everett Wetchler" -d -t -z -r -g -w -p datadotworld,numpy,pandas

Everett Wetchler 2018-05-07 11:35:53 CDT

datadotworld 1.6.0
numpy 1.14.3
pandas 0.22.0
Git hash: a5099cf65aa91c6e242c3c2d760560fd6152bfe0
Git repo: git@github.com:texas-justice-initiative/data-processing.git
watermark 1.6.0


In [3]:
from lib.standardize_police_agency_names import standardize_agency_name

In [4]:
class OISCleaningError(Exception):
    pass

In [5]:
SHEET_NAME = 'OfficersShot'
shootings = pd.read_excel(RAW_FILENAME, sheetname=SHEET_NAME)
print("OIS officers-shot incidents from %s to %s" % (
    shootings['Date of Incident'].min().strftime('%Y-%m-%d'),
    shootings['Date of Incident'].max().strftime('%Y-%m-%d')))
shootings.head()

  return func(*args, **kwargs)


OIS officers-shot incidents from 2015-10-06 to 2018-03-25


Unnamed: 0,No.,Number of Reports Filed,Date OAG Received,Date of Report 1,Name of Agency 1,City of Agency 1,Zip code of Agency 1,Date of Incident,Name of Person 1 Filling out Form,Email Address of Person 1 Filling out Form,...,Non-Officer First Name 2,Non-Officer Last Name 2,Non-Peace Officer's Gender 3,Non-Peace Officer's Age 3,Non-Peace Officer's Race/Ethnicity 3,Non-Officer First Name 3,Non-Officer Last Name 3,Media,Column1,Column2
0,1,1,NaT,10/26/2015,Seguin Police Department,Seguin,78155,2015-10-06,MC Meyers,mcmeyers@seguintexas.gov,...,,,,,,,,,,
1,2,1,NaT,11/3/2015,Bexar County Sheriff's Office,San Antonio,78207,2015-10-30,Det. Frank Stubbs #4048,sstubbs@bexar.org,...,,,,,,,,MySA,News4SanAntonio,
2,3,1,2017-02-21,2/21/2017,Marlin Police Department,Marlin,76661,2015-11-01,Damien Eaglin,chiefeaglin@marlintx.net,...,,,,,,,,Waco Tribune-Herald,,
3,4,1,NaT,12/29/2015,Odessa Police Department,Odessa,79761,2015-12-23,Sgt. David Lara,dlara@odessa-tx.gov,...,,,,,,,,CBS 7,Big Country,
4,5,1,NaT,12/29/2015,Odessa Police Department,Odessa,79761,2015-12-23,Sgt. David Lara,dlara@odessa-tx.gov,...,,,,,,,,CBS 7,Big Country,


## 2. Begin cleaning

### Remove whitespace from column names

In [6]:
shootings.columns = [c.strip().lower() for c in shootings.columns]

### Drop unnecessary columns

In [7]:
shootings.drop(['no.', 'number of reports filed'], axis=1, inplace=True)

### Make column names more machine-friendly

In [8]:
col_renames = {
    "date oag received": "date_ag_received",
    "date of incident": "date_incident",
    "officer first name": "officer_name_first",
    "officer last name": "officer_name_last",
    "injured or deceased's gender": "officer_gender",
    "injured or deceased's age": "officer_age",
    "injured or deceased's race/ethnicity": "officer_race",
    "street address of incident": "incident_address",
    "city of incident": "incident_city",
    "county of incident": "incident_county",
    "zip code of incident": "incident_zip",
    "latitude of incident": "incident_latitude",
    "longitude of incident": "incident_longitude",
    "incident resulted in 1": "incident_result_1",
    "incident resulted in 2": "incident_result_2",
    "media": "media_link_1",
    "column1": "media_link_2",
    "column2": "media_link_3",
}

agency_col_renames = {
    "date of report %d": "agency_%d_report_date",
    "name of agency %d": "agency_%d_name",
    "city of agency %d": "agency_%d_city",
    "zip code of agency %d": "agency_%d_zip",
    "name of person %d filling out form": "agency_%d_name_person_filling_out",
    "email address of person %d filling out form": "agency_%d_email_person_filling_out",
}

for i in range(1, 3):
    for k, v in agency_col_renames.items():
        k = k % i
        v = v % i
        col_renames[k] = v


civilian_col_renames = {
    "non-peace officer's gender %d": "civilian_%d_gender",
    "non-peace officer's age %d": "civilian_%d_age",
    "non-peace officer's race/ethnicity %d": "civilian_%d_race",
    "non-officer first name %d": "civilian_%d_name_first",
    "non-officer last name %d": "civilian_%d_name_last",
}

for i in range(1, 4):
    for k, v in civilian_col_renames.items():
        k = k % i
        v = v % i
        col_renames[k] = v

In [9]:
shootings.columns = [col_renames[c] for c in shootings.columns]

### Translate columns about injury vs death to boolean

In [10]:
def death_injury_officer(s):
    if pd.isnull(s):
        return s
    elif 'death' in s.lower():
        return True
    elif 'injury' in s.lower():
        return False
    raise OISCleaningError('Invalid incident result: ' + s)

def death_injury_civilian(s):
    try:
        return death_injury_officer(s)
    except OISCleaningError as e:
        if 'suicide' in s.lower():
            return True
        raise

shootings['officer_died'] = shootings['incident_result_1'].apply(death_injury_officer)
shootings['civilian_died'] = shootings['incident_result_2'].apply(death_injury_civilian)
shootings['civilian_suicide'] = shootings['incident_result_2'].str.contains('suicide')
shootings.drop(['incident_result_1', 'incident_result_2'], axis=1, inplace=True)

### Replace 'Not Available' string values with true missing values

In [11]:
for c in shootings.columns:
    if shootings[c].dtype == 'object':
        shootings[c][shootings[c] == 'Not Available'] = None

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


### Convert "12345-6789" style zips into simply "12345"

In [12]:
for c in shootings.columns:
    if c.endswith('_zip'):
        shootings[c] = shootings[c].apply(lambda z: z if pd.isnull(z) else str(z).split('-')[0])

### Ensure ages are all valid integers

In [13]:
for c in shootings.columns:
    if c.endswith('_age'):
        for age in shootings[c]:
            if pd.notnull(age):
                # Ensure that ages are valid. Will throw an error otherwise.
                age = int(age)

### Binarize gender columns

In [14]:
def gender_is_male(g):
    if pd.isnull(g):
        return g
    elif g.lower() == 'male':
        return 1
    elif g.lower() == 'female':
        return 0
    raise OISCleaningError('Invalid gender: ' + g)

cols = list(shootings.columns)
for c in cols:
    if c.endswith('_gender'):
        shootings[c + '_male'] = shootings[c].apply(gender_is_male)
        shootings.drop(c, axis=1, inplace=True)

### Convert date columns to ensure validity

In [15]:
for c in shootings.columns:
    if c.startswith('date_') or c.endswith('_date'):
        shootings[c] = pd.to_datetime(shootings[c])

In [16]:
shootings.head()

Unnamed: 0,date_ag_received,agency_1_report_date,agency_1_name,agency_1_city,agency_1_zip,date_incident,agency_1_name_person_filling_out,agency_1_email_person_filling_out,agency_2_report_date,agency_2_name,...,media_link_1,media_link_2,media_link_3,officer_died,civilian_died,civilian_suicide,officer_gender_male,civilian_1_gender_male,civilian_2_gender_male,civilian_3_gender_male
0,NaT,2015-10-26,Seguin Police Department,Seguin,78155,2015-10-06,MC Meyers,mcmeyers@seguintexas.gov,NaT,,...,,,,False,,,0,0.0,,
1,NaT,2015-11-03,Bexar County Sheriff's Office,San Antonio,78207,2015-10-30,Det. Frank Stubbs #4048,sstubbs@bexar.org,NaT,,...,MySA,News4SanAntonio,,False,,,1,1.0,,
2,2017-02-21,2017-02-21,Marlin Police Department,Marlin,76661,2015-11-01,Damien Eaglin,chiefeaglin@marlintx.net,NaT,,...,Waco Tribune-Herald,,,True,,,1,1.0,,
3,NaT,2015-12-29,Odessa Police Department,Odessa,79761,2015-12-23,Sgt. David Lara,dlara@odessa-tx.gov,NaT,,...,CBS 7,Big Country,,False,,,1,1.0,,
4,NaT,2015-12-29,Odessa Police Department,Odessa,79761,2015-12-23,Sgt. David Lara,dlara@odessa-tx.gov,NaT,,...,CBS 7,Big Country,,False,,,1,1.0,,


### Standardize agency names

In [17]:
for c in shootings.columns:
    if c.startswith('agency_') and c.endswith('_name'):
        shootings[c + '_standardized'] = shootings[c].apply(standardize_agency_name)

### Re-order columns more sensibly

In [18]:
section_order = ['date', 'incident', 'officer', 'civilian', 'agency', 'media']
new_col_order = []
for s in section_order:
    this_section_cols = []
    for c in shootings.columns:
        if c.startswith(s + '_'):
            this_section_cols.append(c)
    new_col_order.extend(sorted(this_section_cols))

shape_before = shootings.shape
shootings = shootings[new_col_order]
assert(shootings.shape == shape_before)

In [19]:
shootings.head()

Unnamed: 0,date_ag_received,date_incident,incident_address,incident_city,incident_county,incident_latitude,incident_longitude,incident_zip,officer_age,officer_died,...,agency_2_city,agency_2_email_person_filling_out,agency_2_name,agency_2_name_person_filling_out,agency_2_name_standardized,agency_2_report_date,agency_2_zip,media_link_1,media_link_2,media_link_3
0,NaT,2015-10-06,815 Lamar Drive,Seguin,Guadalupe,29.576609,-97.947667,78155,54,False,...,,,,,,NaT,,,,
1,NaT,2015-10-30,8000 Midcrown,San Antonio,Bexar,29.507478,-98.380029,78218,42,False,...,,,,,,NaT,,MySA,News4SanAntonio,
2,2017-02-21,2015-11-01,301 W. Avenue B,Temple,Bell County,31.095688,-97.344775,76501,42,True,...,,,,,,NaT,,Waco Tribune-Herald,,
3,NaT,2015-12-23,1025 Custer Ave.,Odessa,Ector,31.865143,-102.33585,79761,42,False,...,,,,,,NaT,,CBS 7,Big Country,
4,NaT,2015-12-23,1025 Custer Ave.,Odessa,Ector,31.865143,-102.33585,79761,38,False,...,,,,,,NaT,,CBS 7,Big Country,


## 3. Write

In [20]:
shootings.to_csv(CLEANED_FILENAME, index=False)