In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df=pd.read_csv('../input/individuals-killed-by-the-police/Police Fatalities.csv',encoding='ISO-8859-1')

In [None]:
df.head()

In [None]:
df.shape

Firstly, as I want to clean data, will find if there are duplicated values, this is done using the identifier of the record (UID) because it is unique. 

In [None]:
len(df.UID.unique())

In [None]:
df['UID'].isna().sum()

But we see there are 12488 unique numbers which is different to the total amount of rows, so let's find those 3 values: 

In [None]:
df.UID.value_counts().to_frame()

Above we see there are 3 pairs of record with the same UID which is not possible, below is in more detail, let's just give a new UID to those repeated in order not to drop these records.

In [None]:
df.loc[(df['UID']==13136) | (df['UID']==13139) | (df['UID']==13130)]

In [None]:
df['UID'].max()

Next just give a number greater than 15000 to these 3 records:

In [None]:
df.iloc[12117, 0] = 15000
df.iloc[7526, 0] = 15001
df.iloc[12118, 0] = 15002

In [None]:
df.iloc[[12117,7526,12118],:]

In [None]:
len(df.UID.unique())

Now our dataset contains unique records, the following step is to deal with Timestamp of the Date column, then inconsistent and missing values in the other features. 

In [None]:
df['Date'] =  pd.to_datetime(df['Date'], infer_datetime_format=True)
df.info()

In [None]:
df.isna().sum()

Let's make some plots to understand the distribution of the 5 features in which we have null values, as the dataset contains only one numerical feature 'Age', will see what is the best value to impute to those records:

In [None]:
df.Age.describe()

In [None]:
sns.distplot(df['Age'],bins=25)

In [None]:
df['Age'].skew(axis=0)

Above is computed the skewness of the column Age, as it is normally distributed with a moderate positive skew, we will impute the mean to the missing values:

In [None]:
sns.boxplot(x='Age', data=df)

In [None]:
df['Age'].mean()

In [None]:
df['Age'].replace(np.nan,df['Age'].mean(),inplace=True)

In [None]:
df['Age']=df['Age'].round(decimals=0)

In [None]:
df.isna().sum()

As we can see above, the missing values for age column were imputed with the mean, now let's see for gender:

In [None]:
df['Gender'].value_counts()

In [None]:
df[df['Gender'].isna()]

We could see the name of the people and impute the gender based on it, also we have some unknown set as (Name withheld by police), for these we could impute the most frequent category, which is male.

In [None]:
df.iloc[1749, 3] = 'Female'
df.iloc[2915, 3] = 'Male'
df.iloc[7224, 3] = 'Female'
df.iloc[9793, 3] = 'Male'
df.iloc[9559, 3] = 'Male'
df.iloc[10564, 3] = 'Male'
df.iloc[11251, 3] = 'Male'
df.iloc[12052, 3] = 'Male'

In [None]:
df.isna().sum()

The following is one of the most important and key features when we analyse this kind of data (Race). 
Because of the excesive amount of missing values, almost third part of the total, will be much more complicated to impute values to each record, as no other features could give us an idea to help us. This without a doubt will impact in the insights and conclusion.

In [None]:
df['Race'].value_counts()

In [None]:
df['Race'].value_counts(normalize=True)

In [None]:
sns.countplot(x='Race',data=df)

Above we see the distribution for race, as we have to impute values for almost one third of our data we must find the most appropiate function or even ML model which could avoid the loss of information, example of this is KNN. One of the most useful functions is imputing by ratios of existing data, this last one will be used in order to keep the same proportion in the feature.

In [None]:
round(df['Race'].value_counts(normalize=True),ndigits=2)

In [None]:
df['Race'] = df['Race'].fillna(pd.Series(np.random.choice(['White', 'Black', 'Hispanic','Asian','Native','Other'],
                                                            p=[0.45, 0.29, 0.21, 0.02, 0.02, 0.01], size=len(df))))

In [None]:
df['Race'].value_counts(normalize=True)

In [None]:
df.isna().sum()

Now let's work with the missing values of City, as they are only four, I will impute the most frequent city of the state where it belongs to. For example: The first missing city belongs to the state of california (CA), there the city with most events is Los Angeles, so this one will be imputed.

In [None]:
len(df['City'].unique())

In [None]:
df[df['City'].isna()]

In [None]:
print('Cities with most events in CA:')
pd.DataFrame(df[df['State']=='CA']).groupby(by='City').count().sort_values(by='UID', ascending=False).head()

In [None]:
df.iloc[4110,6]='Los Angeles'

In [None]:
print('City with most events in AL:')
pd.DataFrame(df[df['State']=='AL']).groupby(by='City').count().sort_values(by='UID', ascending=False).head(1).index.item()

In [None]:
df.iloc[9093,6]='Birmingham'

In [None]:
print('City with most events in MS:')
pd.DataFrame(df[df['State']=='MS']).groupby(by='City').count().sort_values(by='UID', ascending=False).head(1).index.item()

In [None]:
df.iloc[10355,6]='Jackson'

In [None]:
print('City with most events in GA:')
pd.DataFrame(df[df['State']=='GA']).groupby(by='City').count().sort_values(by='UID', ascending=False).head(1).index.item()

In [None]:
df.iloc[10549,6]='Atlanta'

Now we should only have one feature to deal with their missing values:

In [None]:
df.isna().sum()

Comparing the Armed column with and without missing values, we will see a huge difference:

In [None]:
df['Armed'].value_counts(normalize=True)

In [None]:
df['Armed'].value_counts(dropna=False)

Clearly from the data above we are in the need to use a more advanced and complex model to deal with missing values for this feature. Due to the fact that missing values are almost half of the total amount of data available it will imply strongly in our conclusion, this is why it can be considered a vital or must-have information. 

Brainstorming a bit more if we don't consider missing values, 6% of the people killed were unarmed. Based on this we will refuse the idea that missing values correspond to people who were unarmed. 

**Pending the application of a more advanced model to determine missing values in the Armed column, the existance of Datetime as a feature becomes this much more complex**, for this work we will impute the categories by the current ratio as we have done before.

Just to know how were killed the criminals with missing values in the column 'Armed':

In [None]:
df[df['Armed'].isna()].groupby(by='Manner_of_death').count().iloc[:,0]

Above we see that 91% of this people were shotted and 8% were tasered to death, from this drastic and effective way of killing we could have the idea that the people involved had extremely dangerous weapons such as guns or knives, but this is just an early hipothesis.

Before building the model let's join duplicates and get some aditional information about the categories in the column 'Armed'.

In [None]:
df['Armed'].replace(to_replace='Toy weapon',value='Toy Weapon',inplace=True)

In [None]:
df['Armed'].value_counts(normalize=True)

In [None]:
len(df['Armed'].unique())

As we have 59 categories in the 'Armed' column, it will be tedious to deal with everyone of them, so let's try to reduce these to the main or most frequent until we cover a significant proportion of the total. For example, if we take the first 10 categories from the list above these cover 98.6% of all events, so let's impute them to the missing values.

In [None]:
df['Armed'] = df['Armed'].fillna(pd.Series(np.random.choice(
    ['Gun', 'Knife', 'Unarmed','Vehicle','Toy Weapon','Machete','Unknown Weapon','Sword','Box Cutter','Hammer'],
    p=[0.6995, 0.2024, 0.0595, 0.0164, 0.0145, 0.0023, 0.0022, 0.0011, 0.0010, 0.0011], size=len(df))))

In [None]:
df['Armed'].value_counts()

In [None]:
df.isna().sum()

**Now that our data is cleaned we can start analizing it:**

In [None]:
df.sample(10)

Let's look at the distribution of ages in 25 bins:

In [None]:
sns.distplot(df['Age'],bins=25)  #We can see it still has a positive moderate skew..

Comparing how many by gender and age:

In [None]:
sns.countplot(x='Gender',data=df)

In [None]:
sns.boxplot(x='Gender',y='Age',data=df)

Plotting distribution for races:

In [None]:
pie=pd.DataFrame(df['Race'].value_counts())

In [None]:
pie.reset_index(inplace=True)

In [None]:
pie.columns

In [None]:
pie

In [None]:
pie.plot(kind='pie', title='Pie chart for Races',y = 'Race', autopct='%1.1f%%', shadow=False, labels=pie['index'], legend = False, fontsize=14, figsize=(12,12))

In [None]:
sns.boxplot(x='Race',y='Age',data=df)

Manner of death by race:

In [None]:
pd.crosstab(df['Manner_of_death'], df['Race'], rownames=['Manner_of_death'], colnames=['Race'])

In [None]:
sns.countplot(x='Manner_of_death',data=df)

Pie chart for weapons used:

In [None]:
df['Armed'].value_counts()

In [None]:
pie3=pd.DataFrame(df['Armed'].value_counts())
pie3.reset_index(inplace=True)
pie3=pie3.head(6)
pie3.loc[6]=['Others',164]
pie3

In [None]:
pie3.plot(kind='pie', title='Pie chart of weapons used',y = 'Armed', autopct='%1.1f%%', shadow=False, labels=pie3['index'], legend = False, fontsize=14, figsize=(12,12))

Let's see  the manner of death of the criminals by weapon that they were using:

In [None]:
pd.crosstab(df['Armed'], df['Manner_of_death'], rownames=['Armed'], colnames=['Manner_of_death']).sort_values(by='Shot',ascending=False).head(10)

The proportion of male/female is constant always, below we see by the type of weapon that were using. 

In [None]:
pd.crosstab(df['Armed'], df['Gender'], rownames=['Armed'], colnames=['Gender']).sort_values(by='Male',ascending=False).head(10)

One of the most controvertial policies and with a huge amount of oppositionists currently in the US is the fact that teenage people or even child could have easy access to guns legally. The following gives how many people less than 21 years old using a gun were killed by police between 2000-2016: 

In [None]:
df[df['Armed']=='Gun'].loc[df['Age']<21].shape[0]

The distribution of ages for people using guns tends to young ages, reaching a peak around 22 years old, making us take into account that this curve could easily move its skew to the right in the following years.

In [None]:
sns.distplot(df[df['Armed']=='Gun']['Age'],bins=40)

Distribution of fatalities in each state by race:

In [None]:
pd.crosstab(df['State'], df['Race'], rownames=['State'], colnames=['Race']).head(10)

Which states have more black people murdered by police:

In [None]:
df5=pd.crosstab(df['State'], df['Race'], rownames=['State'], colnames=['Race'])
df5.loc[(df5['Black']>df5['White'])]

Which states have more hispanic people murdered by police:

In [None]:
df6=pd.crosstab(df['State'], df['Race'], rownames=['State'], colnames=['Race'])
df6.loc[(df6['Hispanic']>df6['White'])]

**Let's plot a map chart indicating predominant race of people murdered by police in every state:**

In [None]:
from IPython.display import Image
display(Image(filename='../input/map-charts-george/Races.png')) 

Above we clearly see that murdering of white people by police are predominant in almost every state, whereas for black people it is more predominant in the east having maximum in Lousiana and New Jersey, for hispanic people only in the state of California.

In [None]:
df['Mental_illness'].value_counts(normalize=True)

21% of the people had a mental illness, let's see the distribution by states:

In [None]:
df[df['Mental_illness']==True].groupby(by='State').count().sort_values(by='UID',ascending=False).iloc[:10,0]

The use of guns or knives is constant for these people too, but also appear new weapons like: Sword, hammer, hatchet, axe, chain saw, etc.

In [None]:
df[df['Mental_illness']==True].groupby(by='Armed').count().sort_values(by='UID',ascending=False).iloc[:20,0]

Age distribution for people with mental illness:

In [None]:
sns.distplot(df[df['Mental_illness']==True]['Age'],bins=25)

Aproximatelly 4% of the people killed were fleeing:

In [None]:
df['Flee'].value_counts(normalize=True)

For those who were fleeing, if we see at the distribution by states, we will find that it's a bit concentrated in the south border states:

In [None]:
df[df['Flee']==True].groupby(by='State').count().sort_values(by='UID',ascending=False).iloc[:15,0]

In [None]:
display(Image(filename='../input/map-charts-george/fleeing.png')) 

Time Series Analysis:  
Firstly let's encode the Date feature to group events by year, month and day.

In [None]:
df['year'] = pd.to_datetime(df['Date']).dt.year
df['month'] = pd.to_datetime(df['Date']).dt.month
df['day'] = pd.to_datetime(df['Date']).dt.day
df['Day of Week'] = pd.to_datetime(df['Date']).dt.day_name()
df.sample(10)

In [None]:
day_week=df.groupby(by='Day of Week').count()
day_week.iloc[:,0]

I will just create a new dataframe from the data above but with the days of the week sorted:

In [None]:
data = [1661,1786,1807,1840,1858,1806,1733]
dddf = pd.DataFrame(data, index =['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
dddf

In [None]:
dddf.iloc[:,0].plot(kind='bar')
plt.show()

In [None]:
days_event=df.groupby(by='day').count()
days_event.iloc[:,0]

In [None]:
days_event.iloc[:,0].plot(kind='bar')
plt.show()

Above we have the distribution of events by day of the month.

In [None]:
months_event=df.groupby(by='month').count()
months_event.iloc[:,0]

In [None]:
months_event.iloc[:,0].plot(kind='bar')
plt.show()

Above we have the distribution of events by month.

In [None]:
years_event=df.groupby(by='year').count()
years_event.iloc[:,0]

In [None]:
years_event.iloc[:,0].plot(kind='bar')
plt.show()

Above we have the distribution of events by year.

What about a stacked bar chart for races by year?
Counting events per state:

In [None]:
df.Race.value_counts()

In [None]:
df11=pd.crosstab(df['year'], df['Race'], rownames=['year'], colnames=['Race'])
df11

In [None]:
ax = df11.plot(kind='bar', stacked=True, figsize=(15, 9))
ax.set_ylabel('foo')
plt.legend(title='labels', bbox_to_anchor=(1.0, 1), loc='upper left')
plt.show()

Map chart for events in every state:

In [None]:
df.State.value_counts()

In [None]:
display(Image(filename='../input/events-by-state/events_state.png')) 

In [None]:
df.sample(10)

In [None]:
df_ht=df.copy(deep=True)  #This will be used later in hipothesis testing

Once we have finished a good analysis, we have to do **feature engineering** and prepare the data to be used in ML models.

In [None]:
df.info()

Features such as: UID and Name which contain unique values per record should be kept the same because it does not add valuable information to the ML model, on the other hand we have 1 numerical, 1 timestamp and 8 categorical features to prepare.

* Numerical feature:  
Age: Apply min-max scaling in order to have all features with values between 0-1.  
Year, month, day: Keep the same.

* Timestamp feature:  
Date: Has already been splitted into year, month, day and day of the week, pending the transformation to be used in a ML model.

* Categorical features:  
Gender: Use LabelBinarizer to encode the two categories as 0 or 1.  
Race: Use get_dummies to apply one hot encoding which will create 6 features.  
City: Due to the fact that the dataset contains more than three thousand cities and the most frequent represent only 2% of the total we will omit this feature and only consider encoding of State.  
State: Use get_dummies to apply one hot encoding which will create 51 features.  
Manner of death: Use get_dummies to apply one hot encoding which will create 4 features.  
Armed: Use get_dummies to apply one hot encoding which will create 58 features.  
Mental illness: Use astype(int) to encode the two categories as 0 or 1.  
Flee: Use astype(int) to encode the two categories as 0 or 1.  
Day of the week: Keep the same.

In [None]:
df['New']=1      #Add a new column to be used with Age in the MinMaxScaler, it will be dropped later
df.sample(10)

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['Age','New']] = scaler.fit_transform(df[['Age','New']])

In [None]:
df.sample(10)

In [None]:
df=df.drop(['New'],axis=1)
df.sample(10)

In [None]:
df.describe()

Below our dataset should have 16 columns:

In [None]:
df.shape

In [None]:
sns.distplot(df['Age'],bins=30)  #Same distribution for the feature, it was only scaled to 0-1.

There are two features which are boolean (Mental illness and Flee), these can not be binarized with LabelBinarizer so we will use astype(int) and the outcome will be exactly what we want. 

In [None]:
df[['Mental_illness','Flee']] = df[['Mental_illness','Flee']].astype(int)

In [None]:
from sklearn.preprocessing import LabelBinarizer
binarizer=LabelBinarizer()
df['Gender'] = binarizer.fit_transform(df['Gender'])

In [None]:
df.describe()   #Gender, mental illness and flee should be added to the describe() output because now are numerical.

In [None]:
df.head()

In [None]:
df9=df.copy(deep=True)   #A copy of the current dataset to have a backup
df9.head()

Columns to be one hot encoded:

In [None]:
col_encoding=['Race','State','Manner_of_death','Armed']   
col_encoding

In [None]:
df9=pd.get_dummies(df9,columns=col_encoding,drop_first=True)

Below our dataset should have 127 columns in total:

In [None]:
df9.shape

In [None]:
df9.describe().T

**Hipothesis testing**

1. Mean age for men and women are the same. t-test and levene-test.  
2. Mean age for every race are the same. ANOVA.  
3. Manner of death by race. Chi-square for test of independence.  

In [None]:
df_ht.sample(10)

1. Mean age for men and women are the same. t-test and levene-test:

In [None]:
data = {'Mean':[df_ht[df_ht['Gender'] == 'Male']['Age'].mean(), df_ht[df_ht['Gender'] == 'Female']['Age'].mean()],
        'Standard_deviation':[df_ht[df_ht['Gender'] == 'Male']['Age'].std(), df_ht[df_ht['Gender'] == 'Female']['Age'].std()]}
 
pd.DataFrame(data, index=['Male','Female'])

In [None]:
import scipy.stats

We define our hipothesis for levene test:  
H0: Age variance for female and male are the same.  
H1: Age variance for female and male are not the same.

In [None]:
scipy.stats.levene(df_ht[df_ht['Gender'] == 'Female']['Age'],
                   df_ht[df_ht['Gender'] == 'Male']['Age'], center='mean')

As the p-value is less than 0.05 we reject the H0 and assume different variances.  
We define our hipothesis for t-test:  
H0: Age mean for female and male are the same.  
H1: Age mean for female and male are not the same.

In [None]:
scipy.stats.ttest_ind(df_ht[df_ht['Gender'] == 'Female']['Age'],
                   df_ht[df_ht['Gender'] == 'Male']['Age'], equal_var = False)

As p-value is less than 0.05 we reject the null hipothesis, therefore there is a difference statistically significant in the mean of age based on gender.

2. Mean age for every race are the same. ANOVA:

In [None]:
data2 = {'Mean':[df_ht[df_ht['Race'] == 'White']['Age'].mean(), df_ht[df_ht['Race'] == 'Black']['Age'].mean(), df_ht[df_ht['Race'] == 'Hispanic']['Age'].mean(), df_ht[df_ht['Race'] == 'Asian']['Age'].mean(), df_ht[df_ht['Race'] == 'Native']['Age'].mean(), df_ht[df_ht['Race'] == 'Other']['Age'].mean()],
        'Standard_deviation':[df_ht[df_ht['Race'] == 'White']['Age'].std(), df_ht[df_ht['Race'] == 'Black']['Age'].std(), df_ht[df_ht['Race'] == 'Hispanic']['Age'].std(), df_ht[df_ht['Race'] == 'Asian']['Age'].std(), df_ht[df_ht['Race'] == 'Native']['Age'].std(), df_ht[df_ht['Race'] == 'Other']['Age'].std()]}
 
pd.DataFrame(data2, index=['White','Black','Hispanic','Asian','Native','Other'])

We define our hipothesis for levene test:  
H0: Age variance for races are the same.  
H1: Age variance for races are not the same.

In [None]:
scipy.stats.levene(df_ht[df_ht['Race'] == 'White']['Age'],
                   df_ht[df_ht['Race'] == 'Black']['Age'], 
                   df_ht[df_ht['Race'] == 'Hispanic']['Age'],
                   df_ht[df_ht['Race'] == 'Asian']['Age'],
                   df_ht[df_ht['Race'] == 'Native']['Age'], 
                   df_ht[df_ht['Race'] == 'Other']['Age'], 
                   center='mean')

As the p-value is less than 0.05 we reject the H0 and assume different variances.  
We define our hipothesis for t-test:  
H0: Age mean for all races are the same.  
H1: Age mean for all races are not the same.

In [None]:
White = df_ht[df_ht['Race'] == 'White']['Age']
Black = df_ht[df_ht['Race'] == 'Black']['Age']
Hispanic = df_ht[df_ht['Race'] == 'Hispanic']['Age']
Asian = df_ht[df_ht['Race'] == 'Asian']['Age']
Native = df_ht[df_ht['Race'] == 'Native']['Age']
Other = df_ht[df_ht['Race'] == 'Other']['Age']

In [None]:
f_statistic, p_value = scipy.stats.f_oneway(White, Black, Hispanic, Asian, Native, Other)
print("F_Statistic: {0}, P-Value: {1}".format(f_statistic,p_value))

As p-value is less than 0.05 we reject the H0, therefore at least one of the races has a mean which is different statistically significant to all others.

3. Manner of death by race. Chi-square for test of independence:

In [None]:
data3=pd.crosstab(df_ht['Manner_of_death'], df_ht['Race'], rownames=['Manner_of_death'], colnames=['Race'])
data3

We will define a new dataset derived from data3 in order to only consider Shot or Other as Manner of death:

In [None]:
Manner=['Shot and Tasered', 'Tasered']
df_ht2=df_ht.copy(deep=True)

In [None]:
df_ht2['Manner_of_death'].replace(['Shot and Tasered', 'Tasered'], 'Other', inplace=True)
data4=pd.crosstab(df_ht2['Manner_of_death'], df_ht2['Race'], rownames=['Manner_of_death'], colnames=['Race'])
data4

We define our hipothesis for Chi-square test:  
H0: Being shooted to death by police is independent of the race.  
H1: Being shooted to death by police is associated to the race.

In [None]:
scipy.stats.chi2_contingency(data4, correction = True)

As p-value is less than 0.05 we reject the H0, therefore we could assume that beeing shooted to death by police is associated with the race.