In [1]:
import pandas as pd
pd.set_option('precision', 2)
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sys

plt.style.use('ggplot')
%config InlineBackend.figure_format = 'retina'
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
colors_ = plt.get_cmap('Set2')(np.linspace(0, 1, 8))

from IPython.core.pylabtools import figsize
from IPython.display import display
figsize(8, 5)

%load_ext watermark
%load_ext autoreload
%autoreload 2
%matplotlib inline

%watermark -d -t -u -v -g -r -b -iv -a "Hongsup Shin"

Author: Hongsup Shin

Last updated: 2021-05-16 11:03:32

Python implementation: CPython
Python version       : 3.7.10
IPython version      : 7.20.0

Git hash: 55299f866b979f9394f5ca724e351c738f0a6c6e

Git repo: https://github.com/texas-justice-initiative/officer_involved_shooting.git

Git branch: create_annual_report

numpy     : 1.20.2
pandas    : 1.2.3
sys       : 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 15:59:12) 
[Clang 11.0.1 ]
matplotlib: 3.4.1
seaborn   : 0.11.1



In [2]:
import preprocess

# Checking the discrepancy between the TJI and OAG datasets
## Motivation
In 2019 and 2020, TJI has fewer reports (about 20 per year) than the number of reports described in the OAG annual reports.

## Methods
OAG has provided OIS numbers (unique identifiers). This data is currently saved on TJI's Google Sheet (not on the TJI website). TJI has updated the OIS data on the website by adding a column for OIS numbers. We will compare these to identify which reports are missing.

## TJI Data
Downloaded on May 16, 2021

In [3]:
df_c = pd.read_csv('../Data/Raw/Website/tji_civilians-shot_May2021.csv')
df_o = pd.read_csv('../Data/Raw/Website/tji_officers-shot_May2021.csv')

In [4]:
df_c.shape, df_o.shape

((1000, 144), (183, 48))

In [5]:
df_c = preprocess.convert_date_cols(df_c)
df_o = preprocess.convert_date_cols(df_o)
df_c['year'] = df_c['date_incident'].dt.year
df_o['year'] = df_o['date_incident'].dt.year

## OIS report number
`ois_report_no` exists in more recent records in the TJI data

### Civilian data

In [6]:
notnans = df_c['ois_report_no'].notna().groupby(df_c['year']).sum()
counts = df_c['date_incident'].groupby(df_c['year']).count()
pd.DataFrame([counts, notnans], index=['Total no. reports', 'OIS no. found']).T

Unnamed: 0_level_0,Total no. reports,OIS no. found
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2015,66,0
2016,176,0
2017,163,0
2018,175,1
2019,186,2
2020,178,155
2021,56,56


### Officer data

In [7]:
notnans = df_o['ois_report_no'].notna().groupby(df_o['year']).sum()
counts = df_o['date_incident'].groupby(df_o['year']).count()
pd.DataFrame([counts, notnans], index=['Total no. reports', 'OIS no. found']).T

Unnamed: 0_level_0,Total no. reports,OIS no. found
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2015,5,0
2016,37,0
2017,26,0
2018,26,0
2019,43,2
2020,35,33
2021,11,11


## OAG data

OIS report numbers given by OAG: these reports were used to create OAG's annual reports (2019, 2020). These excel sheets have two types of columns; total records and withdrwan reports; **we will exclude the latter from the former.**

In [8]:
OAG_2019 = pd.read_excel('../Data/Raw/GoogleDrive/PIC_No._R008943_-_2019_OIS_Annual_Report.xlsx')
OAG_2020 = pd.read_excel('../Data/Raw/GoogleDrive/PIC_NO._R008943_-_2020_Annual_Report.xlsx')

In [9]:
def get_ois_no_oag(df_OAG, victim_type='civilian'):
    """Return the OIS numbers in the OAG annual report by using the OAG excel data
    """
    if victim_type == 'civilian':
        col_reports='OIS Reports' 
        col_withdrawn='Withdrawn OIS Reports'
    elif victim_type == 'officer':
        col_reports='POI Reports'
        col_withdrawn='Withdrawn POI Reports'
    result = set(df_OAG[col_reports].dropna()) - set(df_OAG[col_withdrawn].dropna())
    print(len(result), 'records found.')
    return result

In [10]:
OAG_2019_c = get_ois_no_oag(OAG_2019, 'civilian')
OAG_2019_o = get_ois_no_oag(OAG_2019, 'officer')
OAG_2020_c = get_ois_no_oag(OAG_2020, 'civilian')
OAG_2020_o = get_ois_no_oag(OAG_2020, 'officer')

198 records found.
40 records found.
194 records found.
35 records found.


## Comparison between the OAG and TJI data

There are only 2 OIS numbers in the TJI data in 2019, which means it's not worth making comparison for 2019. Thus, **we will only focus on 2020.**

In [11]:
def get_ois_no_tji(df_tji, year):
    """Return the OIS numbers in the TJI data from a given year"""
    result = set(df_tji[df_tji['year']==year]['ois_report_no'].dropna())
    print(len(result), 'records found.')
    return set(df_tji[df_tji['year']==year]['ois_report_no'].dropna())

In [12]:
TJI_2019_c = get_ois_no_tji(df_c, 2019)
TJI_2019_o = get_ois_no_tji(df_o, 2019)
TJI_2020_c = get_ois_no_tji(df_c, 2020)
TJI_2020_o = get_ois_no_tji(df_o, 2020)

2 records found.
2 records found.
155 records found.
33 records found.


### 1. Missing in TJI but existing in OAG

In [13]:
print('2020, civilian: {} records'.format(len(OAG_2020_c - TJI_2020_c)))
print('2020, officer: {} records'.format(len(OAG_2020_o - TJI_2020_o)))

2020, civilian: 39 records
2020, officer: 2 records


**This is incompatiable with what we found from the OAG report.** In the OAG report (2020), they found 194 reports and in the TJI data (as of May 2021), we have 178 reports. So there are only 16 reports missing. However, when we compared the OIS numbers, we found 39 records missing in the TJI data compared to OAG. This means there are errors in either 
1. the OIS number that OAG sent us
2. the OIS number that TJI implemented in our dataset

### 2. Missing in OAG but existing in TJI

In [14]:
print('2020, civilian: {} records'.format(len(TJI_2020_c - OAG_2020_c)))
print('2020, officer: {} records'.format(len(TJI_2020_o - OAG_2020_o)))

2020, civilian: 0 records
2020, officer: 0 records


In 2020, all the records reported in the TJI data also exist in OAG.

## Conclusions
- OIS numbers are still very sparse and we can't make comparison for 2019 data.
- For the 2020 data, we found 39 records missing in civlian and 2 in officer in the TJI data compared to OAG data. 
- This 2020 comparison is problematic. Based on the OIS number comparison (this notebook), we should have fewer reports in our 2020 data (39 less) but when we compared the total number of reports based on the OAG annual report 2020 and our data (simply counting the number of reports) we only have (16 less). This requires further investigation.