In [1]:
import pandas as pd
import numpy as np
import preprocess # preprocessing module
pd.set_option('precision', 2)

%load_ext watermark
%load_ext autoreload
%autoreload 2

%watermark -d -t -u -v -g -r -b -iv -a "Hongsup Shin" 

numpy  1.18.1
pandas 1.0.3
Hongsup Shin 
last updated: 2020-08-23 09:45:54 

CPython 3.6.10
IPython 7.13.0
Git hash: 5b49c05d03a0c78bc736f8bfce42aeff65c6ee50
Git repo: https://github.com/texas-justice-initiative/officer_involved_shooting_report_2020.git
Git branch: master


# Preprocessing the Officer-Involved Shooting Data

This notebook describes the details of data preprocessing of the officer-involved shooting for civilian (`tji_civiliansShot.csv`) and officer datasets (`tji_officersShot`) in the TJI website. Both datasets were downloaded from the website in June 2020. 

This notebook provides the information of the data preprocessing for TJI's **"Officer-involved Shootings in Texas Report"** published in Aug, 2020. The preprocessed data (the end product of this notebook) is used to create the Data Summary and the Data Insight sections of the report. For the Data Summary section, please check `1.0-hs-data_summary_shooting.ipynb`. For the Data Insight section, check `1.0-hs-analyze_shooting_data_final_v0.1.ipynb`.

In [2]:
df_cd = pd.read_csv('../Data/Raw/Website/tji_civiliansShot.csv'); print(df_cd.shape)
df_os = pd.read_csv('../Data/Raw/Website/tji_officersShot.csv'); print(df_os.shape)

(791, 143)
(137, 47)


# Civilian dataset

# 1. Convert the date column formats

Date of year information in both datasets exist as object. We need to change it to numpy datetime format for convenience.

In [3]:
print(df_cd['date_incident'][:5])
print(df_cd['date_incident'].dtypes)

0    2015-09-02
1    2015-09-03
2    2015-09-04
3    2015-09-05
4    2015-09-08
Name: date_incident, dtype: object
object


In [4]:
df_cd = preprocess.convert_date_cols(df_cd, 'date')
df_os = preprocess.convert_date_cols(df_os, 'date')

In [5]:
cols_date_cd = [col for col in df_cd.columns if 'date' in col]
cols_date_os = [col for col in df_os.columns if 'date' in col]
print(cols_date_cd)
print(cols_date_os)

['date_ag_received', 'date_incident', 'agency_report_date_1', 'agency_report_date_2', 'agency_report_date_3', 'agency_report_date_4', 'agency_report_date_5', 'agency_report_date_6', 'agency_report_date_7', 'agency_report_date_8', 'agency_report_date_9', 'agency_report_date_10', 'agency_report_date_11']
['date_ag_received', 'date_incident', 'agency_report_date_1', 'agency_report_date_2']


In [6]:
print(df_cd[cols_date_cd].dtypes)
print(df_os[cols_date_os].dtypes)

date_ag_received         datetime64[ns]
date_incident            datetime64[ns]
agency_report_date_1     datetime64[ns]
agency_report_date_2     datetime64[ns]
agency_report_date_3     datetime64[ns]
agency_report_date_4     datetime64[ns]
agency_report_date_5     datetime64[ns]
agency_report_date_6     datetime64[ns]
agency_report_date_7     datetime64[ns]
agency_report_date_8     datetime64[ns]
agency_report_date_9     datetime64[ns]
agency_report_date_10    datetime64[ns]
agency_report_date_11    datetime64[ns]
dtype: object
date_ag_received        datetime64[ns]
date_incident           datetime64[ns]
agency_report_date_1    datetime64[ns]
agency_report_date_2    datetime64[ns]
dtype: object


We can add year and month columns.

In [7]:
df_cd['year'] = df_cd['date_incident'].dt.year
df_cd['month'] = df_cd['date_incident'].dt.month

# 2. Correct the county names
Based on the [the cencus data](https://demographics.texas.gov/Data/TPEPP/Estimates/), we can correct the county names.

In [8]:
df_census = pd.read_pickle('../Data/Interim/census_county_race_2010.pkl'); df_census.head()

Unnamed: 0_level_0,WHITE,BLACK,HISPANIC,OTHER
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ANDERSON,34383,12472,10550,1574
ANDREWS,6794,206,11371,307
ANGELINA,55069,13751,20476,2391
ARANSAS,15716,244,6896,868
ARCHER,8238,40,933,248


In [9]:
print(set(df_cd['incident_county']) - set(df_census.index))
print(set(df_os['incident_county']) - set(df_census.index))

{'QUAY (NM)', 'COLIN'}
set()


There is one incident that is from the Quay county in NM. 
- This incident began when an officer tried to stop a driver in Amarillo. The driver evaded, and officers from a few agencies pursued the vehicle. Just as the driver crossed state lines, he got into a shootout with officers. A Texas DPS trooper in Quay County, New Mexico, shot the driver. 
- Since this is only single incident, so we are going to keep it for now.

In [10]:
(df_cd['incident_county'] == 'QUAY (NM)').sum()

1

We can also correct the typo:

In [11]:
df_cd['incident_county'] = df_cd['incident_county'].str.replace('COLIN','COLLIN')

# 3. Remove duplicates
The dataset does not have a unique identifier and thus I used the civilian full name and the date of incident to identify duplicates.
I found 3 incidents that have duplicates. These seem to be reports that were filed twice. I kept the most recent reports but deleted the rows with earlier reports assuming that the more recent reports may have more up-to-date information and fewer errors.

In [12]:
df_cd_unique, df_cd_duplicates = preprocess.get_duplicates_from_cols(df_cd, ['civilian_name_full', 'date_incident'], what_to_keep='first'); df_cd_duplicates

Unnamed: 0,civilian_name_full,date_incident
503,SHAWN BOYETT,2018-07-18
508,JAVIER LOPEZ,2018-07-20
598,MATTHEW REYES MIRELES,2019-01-25


No duplicates are found in the officer data.

In [13]:
df_os_unique, df_os_duplicates = preprocess.get_duplicates_from_cols(df_os, ['officer_name_first', 'officer_name_last', 'date_incident'], what_to_keep='first')
print(df_os_duplicates.shape[0])

0


In [14]:
df_cd = df_cd_unique.copy()
df_os = df_os_unique.copy()
print(df_cd.shape)
print(df_os.shape)

(788, 145)
(137, 47)


# 4. Select data from 2016 to 2019

We limit the analysis from 2016 to 2019 since 2015 data is incomplete (data collection started in September 2015). Thus, we can slice the data based on when the incident happened.

In [15]:
years = range(2016, 2020)

In [16]:
df_cd = df_cd.loc[df_cd['year'].isin(years)]; print(df_cd.shape)

(697, 145)


# 5. Create a boolean column to represent the severity of an incident (death or injury)

The analyses in the report focus on not only the entire incidents but also the incidents where civilians and officers were shot and killed. Thus, for convenience, we are creating a boolean column that represents the severity (whether the person shot was killed or not).

In [17]:
df_cd['died'] = df_cd['civilian_died']=='DEATH'

# 6. Clean the text from `incident_result_of` in the civilian shooting data
The column that represents the cause of incident `incident_result_of` needs variations of text but they can be categorized into 5 groups: 'Traffic Stop', 'Emergency/Request for Assistance', 'Execution of a Warrant', 'Hostage/Barricade/Other Emergency', 'Other'. Thus, cleaning is required.

In [18]:
np.sort(df_cd['incident_result_of'].unique())

array(['EMERGENCY', 'EMERGENCY CALL OR REQUEST FOR ASSISTANCE',
       'EMERGENCY CALL OR REQUEST FOR ASSISTANCE, TRAFFIC STOP',
       'EMERGENCY CALL OR REQUEST FOR ASSISTANCE; EXECUTION OF A WARRANT',
       'EMERGENCY CALL OR REQUEST FOR ASSISTANCE; HOSTAGE, BARRICADE, OR OTHER EMERGENCY SITUATION',
       'EMERGENCY CALL OR REQUEST FOR ASSISTANCE; HOSTAGE/BARRICADE/OTHER EMERG SITUATION',
       'EMERGENCY CALL OR REQUEST FOR ASSISTANCE; OTHER',
       'EMERGENCY CALL OR REQUEST FOR ASSISTANCE; TRAFFIC STOP',
       'EMERGENCY CALL OR REQUEST FOR ASSISTANCE; TRAFFIC STOP; EXECUTION OF A WARRANT',
       'EXECUTION OF A WARRANT', 'EXECUTION OF A WARRANT; OTHER',
       'HOSTAGE, BARRICADE OR OTHER EMERGENCY SITUATION',
       'HOSTAGE, BARRICADE, OR OTHER EMERGENCY SITUATION',
       'HOSTAGE, BARRICADE, OR OTHER EMERGENCY SITUATION; OTHER - SPECIFY TYPE OF CALL',
       'HOSTAGE/BARRICADE/OTHER EMERG SITUATION',
       'HOSTAGE/BARRICADE/OTHER EMERGENCY SITUATION', 'OTHER',
      

I followed these stpes for cleaning:
- I converted `EMERGENCY` to `EMERGENCY CALL OR REQUEST FOR ASSISTANCE`
- I converted `EMERGENCY CALL OR REQUEST FOR ASSISTANCE, TRAFFIC STOP` to `EMERGENCY CALL OR REQUEST FOR ASSISTANCE; TRAFFIC STOP`
- I converted the category names with consistency (see `clean_incident_causes()`)

In [19]:
df_cd.loc[df_cd['incident_result_of']=='EMERGENCY', 'incident_result_of'] = 'EMERGENCY CALL OR REQUEST FOR ASSISTANCE'
df_cd.loc[df_cd['incident_result_of']=='EMERGENCY CALL OR REQUEST FOR ASSISTANCE, TRAFFIC STOP', 'incident_result_of'] = \
'EMERGENCY CALL OR REQUEST FOR ASSISTANCE; TRAFFIC STOP'
df_cd['incident_result_of'] = df_cd['incident_result_of'].str.strip()

In [20]:
def clean_incident_causes(s):
    if 'EMERGENCY' in s:
        return 'Emergency/Request for Assistance'
    elif 'HOSTAGE' in s:
        return 'Hostage/Barricade/Other Emergency'
    elif 'OTHER' in s:
        return 'Other'
    elif 'TRAFFIC STOP' in s:
        return 'Traffic Stop'
    elif 'WARRANT' in s:
        return 'Execution of a Warrant'
    else:
        raise ValueError('Double check the string from incident causes.')

In [21]:
incident_causes_list = ['Traffic Stop', 'Emergency/Request for Assistance', 
                        'Execution of a Warrant', 'Hostage/Barricade/Other Emergency', 'Other']

Even though it's a single column, as you can see above, multiple causes can exist. I split these into multiple columns and one-hot encoded (binary for each cause).

In [22]:
df_causes_list = df_cd['incident_result_of'].str.split(';')
df_causes_list_clean = df_causes_list.apply(lambda x: [clean_incident_causes(s) for s in x]).apply(pd.Series)
df_causes_list_clean_separated = df_causes_list_clean.stack().str.get_dummies().sum(level=0)[incident_causes_list]

In [23]:
df_causes_list_clean_separated.sum()

Traffic Stop                          88
Emergency/Request for Assistance     386
Execution of a Warrant                71
Hostage/Barricade/Other Emergency     37
Other                                176
dtype: int64

In [24]:
df_cd = pd.concat([df_cd, df_causes_list_clean_separated], axis=1); print(df_cd.shape)
df_cd[incident_causes_list].head()

(697, 151)


Unnamed: 0,Traffic Stop,Emergency/Request for Assistance,Execution of a Warrant,Hostage/Barricade/Other Emergency,Other
66,0,1,0,0,0
67,0,0,0,0,1
68,0,1,0,0,0
69,0,1,0,0,0
70,0,0,0,0,1


# 7. Discretize civilian age into bins based on the census age groups

I found these mortality datasets from Texas DSHS:
    - [Death counts by race, gender and age, 2013](https://www.dshs.texas.gov/chs/vstat/vs13/t26a.aspx)
        - All rates per 100k population
        - This dataset has race demographics (white, black, hispanic) and also has gender demographics. (male, female) 
        - I only focused on the **male** data mostly because the civilian dataset has very few female civilians.
    - [Death counts by county, 2013](https://www.dshs.texas.gov/chs/vstat/vs13/t26b.aspx)
        - All rates per 1k Population
        - This dataset has race demographics (white, black, hispanic, other) but does not have gender distinction.

This census data has age information that is binned into the following groups: '<1', '1-4' '5-14' '15-24' '25-34' '35-44' '45-54' '55-64' '65-74' '75+'. I decided to use this binning to compare our data with this census data. To do so, I have extracted a male mortality data (the age analysis only looks at men because female data has extremely small sample size for a meaningful comparison). I removed the newborns (less than age 1) and binned the age information from the civiliand data.

In [25]:
df_death_age_male = pd.read_csv('../Data/Raw/Census/mortality_rate_by_age_male.csv', index_col='Age').iloc[1:, :]; print(df_death_age_male.shape)

(9, 4)


In [26]:
df_death_age_male.head()

Unnamed: 0_level_0,WHITE,BLACK,HISPANIC,TOTAL
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1-4,95,47,103,245
5-14,108,57,137,302
15-24,811,297,700,1808
25-34,1210,505,917,2632
35-44,1755,628,1199,3582


In [27]:
age_range_names = df_death_age_male.index.values
print(age_range_names)
print(len(age_range_names))

['1-4' '5-14' '15-24' '25-34' '35-44' '45-54' '55-64' '65-74' '75+']
9


In [28]:
bins = [5, 15, 25, 35, 45, 55, 65, 75, 100]
df_cd['civilian_age_binned'] = np.digitize(df_cd['civilian_age'], bins)

In [29]:
df_cd['civilian_age_binned'].value_counts().sort_index()

0      1
1      3
2    167
3    217
4    148
5     93
6     52
7      8
8      4
9      4
Name: civilian_age_binned, dtype: int64

Note that even though we have 9 age ranges (`age_range_names`) but we now have 10 indices (0 to 9). This is because the last index (`9`) represents data that cannot be included in any of the defined ranges (either missing or out of the range (in this case it was all missing data).

# 8. Calculate the report delay
I've created a column that represents the report delay (the number of days between the incident and the date that the agency received its report (agencies are required by law to file within 30 days) = `date_ag_received - date_incident`

In [30]:
df_cd['date_ag_received'].isnull().sum()

166

In [31]:
df_cd.loc[df_cd['date_ag_received'].isnull(), 'year'].value_counts()

2016    160
2017      6
Name: year, dtype: int64

Most of the missing data comes from 2016 and this is why I excluded 2016 from the report-delay analysis.

In [32]:
df_cd['delay_days'] = (df_cd['date_ag_received'] - df_cd['date_incident']).dt.days

In [33]:
df_cd['delay_days'].value_counts().sort_index()

-364.0     1
 0.0      11
 1.0      26
 2.0      35
 3.0      29
          ..
 340.0     1
 365.0     1
 386.0     1
 390.0     1
 738.0     1
Name: delay_days, Length: 80, dtype: int64

Negative delay does not make sense and thus let's covert it to nans.

In [34]:
df_cd.loc[df_cd['delay_days']<0, 'delay_days'] = np.nan

# 9. Bin the report delays

For visualization purposes, I discretized the delay into bins.

In [35]:
bins = [0, 7, 14, 30, 60, 90, 180, 360, 720]
binnames = ['Same Day'] + ['{} to {} Days'.format(bins[i]+1, bins[i+1]) for i in range(len(bins)-1)] + ['More than 720 Days']
print(binnames)

['Same Day', '1 to 7 Days', '8 to 14 Days', '15 to 30 Days', '31 to 60 Days', '61 to 90 Days', '91 to 180 Days', '181 to 360 Days', '361 to 720 Days', 'More than 720 Days']


Nans get `-1` labels.

In [36]:
delay_bins = np.digitize(df_cd['delay_days'].values, bins, right=True)
nan_inds = np.argwhere(pd.isnull(df_cd['delay_days']).values).ravel()
delay_bins[nan_inds] = -1

In [37]:
print(delay_bins)

[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1  7 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1  7 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  7 -1 -1  7 -1
 -1 -1 -1 -1 -1 -1  6 -1 -1 -1 -1  6  6 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  6
  6 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  2 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  5 -1 -1  5  3 -1 -1 -1 -1  4
  4  4 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  3  3  1  3  1  2  1  3  6  1
  3  3  1  1  1  1  2  2  3  3  2  1  3  3  3  2  1  8  1  2  6  1  2  1
  3  2  1  3  3  1  1  0  2  1  2  3  4  1  3  0  2  3  2  3  2  2  2  2
  1  3  2  1  2  4  2  3  1  3  3  3  1  1  1  1  3  2  4  1  2  2  2  1
  0  9  1  1  3  1  4  3  2  2  4  3  3  2  3  1  1  3  3  3  2  3  2  3
  3  3  3  1  3  1  3  4  5  2  6  4  3  2  3  3  3  1  3  3  3  3  1  1
  1  1  1  2  3  1  1  3  3  3  2  3  4  1  2  4  6

In [38]:
df_cd['delay_bin_label'] = delay_bins

# 10. Save the preprocessed data

In [39]:
df_cd.to_pickle('../Data/Preprocessed/civilian_shooting_preprocessed.pkl') # pickle preserves the datetime format 
df_cd.to_csv('../Data/Preprocessed/civilian_shooting_preprocessed.csv') # csv does not

# Officer dataset

# 1. Extract year and month

In [40]:
df_os['year'] = df_os['date_incident'].dt.year
df_os['month'] = df_os['date_incident'].dt.month

2016-2019

In [41]:
df_os = df_os.loc[df_os['year'].isin(years)]; print(df_os.shape)

(130, 49)


# 2. Calculate the report delay

In [42]:
df_os['delay_days'] = (df_os['date_ag_received']-df_os['date_incident']).dt.days

In [43]:
(df_os['delay_days']<0).sum()

0

# 3. Save the preprocessed data

In [44]:
df_os.to_pickle('../Data/Preprocessed/officer_shooting_preprocessed.pkl')
df_os.to_csv('../Data/Preprocessed/officer_shooting_preprocessed.csv')