# **Sample Quality Check - Durham Police Department Arrest Reports**
### Introduction
<font color=#FF0000>*TBD*</font><br><br>
This document contains sample code and instructions on how to evaluate the conditions of data once it is in a table format based on factors such as accuracy, completeness, consistency, reliability and whether it's up to date.
- **Quality metrics**: 
    - Completenes % (Counts & proportions of NAs)
        - Which NAs are relevant? Which should we try to impute or delete entirely?
    - Consistency (Value Counts, search for typos)
        - How to fix inconsistent categorical values?
    - Reliability (Perceived vs. Self reported, which values should be consistent?)
    - Currency (Dates, how old is too old?)
- **Summary statistics**:
    - Mean, min, max for continuous variables, crosstabs for discrete
    - Cross-comparison counts for discrete categorical variables 
- **Distributions**:
    - Histograms for continuous variables
    - Crosstabs, barplots for discrete categoricals variables

### **Step #1:** Load the data

In [32]:
#Load that data in table format:
# https://www.practicaldatascience.org/html/pandas_series.html offers a quick tutorial on how to use the Pandas library if you are not familiar.
# The most common table data format is csv (comma separated values).
# Other common functions you may use to load the data are: pd.read_excel, pd.read_stata.


import ipywidgets as widgets
import io
import os
import pandas as pd
from IPython.display import Javascript

pd.set_option("display.max_columns", 500)
pd.set_option("display.max_rows", 500)

data_loaded=False


Select your datafile from the dropdown, and then press load.
You may need to upload your datafile to the "data" directory on the left.

In [36]:
outputWidget = widgets.Output(layout={'border': '1px solid black'})

fileSelect = widgets.Dropdown(
    options= os.listdir('data/'),
    disabled=False
)
display(fileSelect)
loadButton = widgets.Button(
    description='Load data file',
)
def load_data():
    global arrests
    arrests = pd.read_csv(os.path.join("data/",fileSelect.value),index_col=[0])
    global data_loaded
    data_loaded=true
    display(Javascript('IPython.notebook.execute_cells_below()'))
    with outputWidget:
        display("Successfully loaded the datafile " + fileSelect.value )

loadButton.on_click(lambda b: load_data())
display(loadButton,outputWidget)

Dropdown(options=('arrests_charges.csv',), value='arrests_charges.csv')

Button(description='Load data file', style=ButtonStyle())

Output(layout=Layout(border_bottom='1px solid black', border_left='1px solid black', border_right='1px solid b…

Let's take a quick look at the data.  Select how many rows you would like to preview, then press the 'Preview' button.

In [13]:
outputWidget2 = widgets.Output(layout={'border': '1px solid black'})
# Take a first look:
pd.set_option("display.max_rows", None)
numRows = widgets.Dropdown(
    options=['5', '10', '15', '20'],
    value='5',
    description='Number:',
    disabled=False,
)
display(numRows)
previewButton = widgets.Button(
    description='Preview',
)
def preview_data():
    outputWidget2.clear_output()
    with outputWidget2:
        display(arrests.sample(int(numRows.value)))

previewButton.on_click(lambda b: preview_data())
display(previewButton,outputWidget2)

Dropdown(description='Number:', options=('5', '10', '15', '20'), value='5')

Button(description='Preview', style=ButtonStyle())

Output(layout=Layout(border_bottom='1px solid black', border_left='1px solid black', border_right='1px solid b…

### **Step #2:** Which type of data do we have?
Typically, police records inform of interactions between the police and a civilian. The first step in measuring the quality of your dataset is finding out which type of data you have. 
As you understand the types of fields you have, define the unit of observation. Think of what does one row in your table represent.<br>
In this example, one row is one police charge. However, there is a hiwerarchy. All charge to the same person on one police interaction are under the same "case" which is identifiable by an "arrest number". Keep in mind this hierarchy is important to understand how police interacts with individual people. 

In [21]:
outputWidget3 = widgets.Output(layout={'border': '1px solid black'})
layout = widgets.Layout(width='auto', height='40px') #set width and height
columnsButton = widgets.Button(
    description='Display column names',
    layout=layout,
)
def displayColumns():
    with outputWidget3:
        display(arrests.columns)

columnsButton.on_click(lambda b: displayColumns())
display(columnsButton,outputWidget3)



Button(description='Display column names', layout=Layout(height='40px', width='auto'), style=ButtonStyle())

Output(layout=Layout(border_bottom='1px solid black', border_left='1px solid black', border_right='1px solid b…

Which of these columns represents a demographic category of interest, like race?

In [23]:
if data_loaded:
    outputWidget4 = widgets.Output(layout={'border': '1px solid black'})
    raceDropDown = widgets.Dropdown(
        options=arrests.columns,
        disabled=False,
    )
    def displayRaceDropdown():
        with outputWidget4:
            display(raceDropDown)

    columnsButton.on_click(lambda b: displayColumns())
    display(columnsButton,outputWidget4)
    display(raceDropDown)


Dropdown(options=('agencyname', 'datetimeofarrest', 'file', 'arrestnumber', 'scars_tattoes_bodymarkings_etc', …

Which of these columns represents a numerical value of police interactions?

In [24]:
if data_loaded:
    numericalDropdown = widgets.Dropdown(
        options=arrests.columns,
        disabled=False,
    )
    display(numericalDropdown)

Dropdown(options=('agencyname', 'datetimeofarrest', 'file', 'arrestnumber', 'scars_tattoes_bodymarkings_etc', …

In [28]:
if data_loaded:
    display(arrests.groupby([raceDropDown.value])[numericalDropdown.value].mean())

race
A    1.090909
B    1.081053
I    1.450980
U    1.018868
W    1.057388
Name: charge_counts, dtype: float64

In [None]:
#pd.crosstab(index=arrests['charge_type'], columns=arrests['sex'])

sex,F,M
charge_type,Unnamed: 1_level_1,Unnamed: 2_level_1
Fel,1948,6861
Misd,4260,14457
