# US Police Violence & Racial Equity
**Data from a variety of sources to support analysis promoting fair treatment**

The aim of this notebook is to use some tools to speed up the process of the exploratory data analysis.
We have plenty of different datasets and i will have a first look at what we got here.

In [None]:
import pandas as pd

pd.set_option('display.max_columns', 100) # Setting pandas to display a N number of columns
pd.set_option('display.max_rows', 10) # Setting pandas to display a N number rows
pd.set_option('display.width', 1000) # Setting pandas dataframe display width to N

#plotting library
import matplotlib.pyplot as plt
import seaborn as sns             

# interactive plotting library
import plotly.express as px       
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import iplot
from plotly.subplots import make_subplots

import pandas_profiling # library for automatic EDA
%pip install autoviz # installing and importing autoviz, another library for automatic data visualization
from autoviz.AutoViz_Class import AutoViz_Class

from IPython.display import HTML
from IPython.display import display # display from IPython.display

import os

In [None]:
from scipy import stats # statistical library
from statsmodels.stats.weightstats import ztest # statistical library for hypothesis testing
from itertools import cycle # function used for cycling over values

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
print("")
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Section 0 - Getting the data

In [None]:
home = '../input/police-violence-in-the-us'
try:
    deaths_arrests_race = pd.read_csv(os.path.join(home, 'deaths_arrests_race.csv'))
    dod_equipment_purchases = pd.read_csv(os.path.join(home, 'dod_equipment_purchases.csv'))
    fatal_encounters_dot_org = pd.read_csv(os.path.join(home, 'fatal_encounters_dot_org.csv'))
    po_contracts = pd.read_csv(os.path.join(home, 'police_contracts.csv'))
    po_deaths_538 = pd.read_csv(os.path.join(home, 'police_deaths_538.csv'))
    po_employment_fbi = pd.read_csv(os.path.join(home, 'police_employment_fbi.csv'))
    po_killings = pd.read_csv(os.path.join(home, 'police_killings.csv'))
    po_policies = pd.read_csv(os.path.join(home, 'police_policies.csv'))
    shootings_wash_post = pd.read_csv(os.path.join(home, 'shootings_wash_post.csv'))
except:
    print('File names have changed!')

In [None]:
datasets = {"deaths_arrests_race": deaths_arrests_race,
            "dod_equipment_purchases" : dod_equipment_purchases,
            "fatal_encounters_dot_org": fatal_encounters_dot_org,
            "po_contracts": po_contracts,
            "po_deaths_538": po_deaths_538,
            "po_employment_fbi": po_employment_fbi,
            "po_killings": po_killings,
            "po_policies": po_policies,
            "shootings_wash_post": shootings_wash_post}

# Section 1 - Data Exploration

## First look at all datasets

In [None]:
keys_datasets = []
for key, value in datasets.items():  #accessing keys
    #print(key,end=',')
    keys_datasets.append(key)
    
print(keys_datasets)

In [None]:
for key, value in datasets.items():
    display("Dataset name: %s" % key)
    display(value.head(5),
            value.shape,
            value.info(),
            value.describe(include = "all"),
            value.columns,
            #value.value_counts(),
            value.nunique())
    

## Automated Preprocessing with dabl
As part of the preprocessing, dabl will attempt to identify missing values, feature types and erroneous data. if the detection of semantic types (continuous, categorical, ordinal, text, etc) fails, the user can provide type_hints. Let's demo the library with the help of the titanic dataset

In [None]:
# Installing and loading the library
!pip install dabl

import dabl

In [None]:
shootings_wash_post_clean = dabl.clean(shootings_wash_post, verbose=1)

types = dabl.detect_types(shootings_wash_post)
print(types) 

## Exploratory Data analysis with dabl
dabl provides a high-level interface that summarizes several common high-level plots. For low dimensional datasets, all features are shown; for high dimensional datasets, only the most informative features for the given task are shown

In [None]:
dabl.plot(shootings_wash_post, target_col="manner_of_death")

To begin our analysis, lets take our first look at the dataset. 

To save some precious time on our Exploratory Data Analysis process, we are going to use 2 libraries: "pandas_profiling" and "autoviz".

## Exploratory Data analysis with pandas profiling
**pandas_profiling**

The pandas profiling library is really useful on helping us understand the data we're working on.
It saves us some precious time on the EDA process.

In [None]:
report = pandas_profiling.ProfileReport(shootings_wash_post)

In [None]:
# Let's now visualize the report generated by pandas_profiling.
display(report)

# Also, there is an option to generate an .HTML file containing all the information generated by the report.
# report.to_file(output_file='report.html')

In [None]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

df = pd.DataFrame(
    np.random.rand(100, 5),
    columns=["a", "b", "c", "d", "e"]
)
#To generate the report, run:

profile = ProfileReport(df, title="Pandas Profiling Report")

profile = ProfileReport(df, title='Pandas Profiling Report', explorative=True)

> ## Exploratory Data analysis with AutoViz
**AutoViz**



In [None]:
''' Another great library for automatic EDA is AutoViz.
With this library, several plots are generated with only 1 line of code.
When combined with pandas_profiling, we obtain lots of information in a
matter of seconds, using less then 5 lines of code. '''

AV = AutoViz_Class()

# Let's now visualize the plots generated by AutoViz.
report_2 = AV.AutoViz(os.path.join(home, 'shootings_wash_post.csv'))

In [None]:
# First distribution for the hypothesis test: Ages of survivors
dist_a = df_survivors['Age'].dropna()

# Second distribution for the hypothesis test: Ages of non-survivors
dist_b = df_nonsurvivors['Age'].dropna()

In [None]:
# Z-test: Checking if the distribution means (ages of survivors vs ages of non-survivors) are statistically different
t_stat, p_value = ztest(dist_a, dist_b)
print("----- Z Test Results -----")
print("T stat. = " + str(t_stat))
print("P value = " + str(p_value)) # P-value is less than 0.05

print("")

# T-test: Checking if the distribution means (ages of survivors vs ages of non-survivors) are statistically different
t_stat_2, p_value_2 = stats.ttest_ind(dist_a, dist_b)
print("----- T Test Results -----")
print("T stat. = " + str(t_stat_2))
print("P value = " + str(p_value_2)) # P-value is less than 0.05

## Seaborn

Checking out the plots and hypothesis tests over fare distributions, comparing Survivors and non-Survivors, we can again observe that there is a statistically significant difference between the means of both groups.

When checking out the boxplots, we can see that fare values of survivors are generally higher, when compared to fare values of non-survivors. This information is probably related to the "Pclass" percentages we have seen before on the pie plots