Dear Kagglers,

For my first Kernel, I chose Stanford Open Policing Project - Illinois.  This dataset includes road police control data from 2011 to 2015 are gathered including : driver age, race, vehicle brand and potential violations.

In this small EDA, I will try to understand if there are correlation between driver's profile, car brands and potential violations. The very end target is to build a predictive model (in my 2nd Kernel :)) to help police to target the right profiles when they stop vehicles.

Folks, your feedback are more than welcomed, since its my first Kernel I am waiting your remarks to get better and improve my contributions for the community.

## I. Initial Data Loading and Processing
                a. Data Loading + Processing 
                b. Features Adaption
##  II. General EDA : Understanding the dataset 
                a. Gender, Violation, Driver race 
                b. % of Arrests = f(hour_day)
                c. % of Arrests = f(hour_day)
## III. Racial Based Study
                a. Repartition (%) of Violation for each driver race 
                b. Repartition (%) of Stop Outcome for each driver race 
                c. Ratio (%) of Search Conducted for each driver race 
                d. Ratio of Contraband found when search is conducted (%)
## IV. Vehicle Brand Based Study
                a. Top ten Vehicle Brands for Arrests  Number
                b. Top ten Vehicle Brands for Speeding Violations  
                c. Top fifteen Vehicle Models for Ratio of Contraband found when 




In [14]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import itertools

sns.set(style="whitegrid")


# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

#    Load the data set : put a limit when writing drafts to avoid loading time
lim = -1
if lim>0:
    data = pd.read_csv("../input/IL.csv", low_memory=False).head(lim)
else:
    data = pd.read_csv("../input/IL.csv", low_memory=False)

**a. Data Loading + Processing :**

Objective : Arrange the data in order to be able to extract the information we want to understand the dataset

               - Extract columns names : to get all the features available
               - Convert the columns to string (astype(str)) to facilitate data processing


In [15]:
#    Informations about the dataset :
Cols = data.columns
print(Cols)
N_controls = len(data['id']) # Number of Controls
# ------------------------------------------------------------------------------------------------------------

# Informations loading + Preprocessing :
        # Loading of informations
Ages = np.sort(data['driver_age'].unique()).tolist()
Races = np.sort(data['driver_race'].unique()).tolist()
Violations = np.sort(data['violation'].unique()).tolist()
Vehicles = np.sort(data['vehicle_type'].unique()).tolist()
data['stop_time'] = data['stop_time'].astype(str) # Put the hours in string format
data['search_conducted'] = data['search_conducted'].astype(str)
data['contraband_found'] = data['contraband_found'].astype(str)  


**b. Features adaptation :**

Objective : Arrange the data in order to be able to extract the information we want to understand the dataset

               - ['stop_time'] : create a new column to get normalized stop time per hour, per half hour 
               - ['vehicle_type'] : extract the brand out of the model 

In [16]:
  # Convert hours to hours range :
def Time_to_Range(time):
    global hour, mins
    hour = int(time[:2])
    mins = int(time[3:])
    if mins<15:
        mins = '00'
    elif mins>=15 and mins<30:
        mins = '15'
    elif mins>=30 and mins<45:
            mins = '30'
    elif mins>=45:
        mins = '00'
        
    if hour<10:
        hour = '0' + str(hour)
    else: 
        if hour ==24:
            hour = 0
        hour = str(hour)
    t= str(hour) +":" + str(mins)
    return "'" + str(t) + "'"
Method = 'A' # Choose the method to remove NaN 
if Method =='B':
    data['stop_time'].replace('nan','12:00',inplace=True)
else:
    data.drop(data[data.stop_time=='nan'].index, inplace=True)
data['hour'] = data['stop_time'].map(lambda x : x[:2])
data['vehicle_brand'] = data['vehicle_type'].map(lambda x : x[:3])


**a. Gender, Violation, Driver race**

Objective : Present the repartition of all main features (%) to have an overview of the dataset

               - ['driver_gender'] : ratio male/female
               - ['driver_ratio'] : split between races 
               - ['violation'] : split between violations
               - ['stop_outcome'] : split between outcomes (Citation, Written Warning)

In [17]:
fig1 = plt.subplots(figsize=(10,15))
cat = ['driver_gender','driver_race','violation','stop_outcome']
length=len(cat)
for i,j in itertools.zip_longest(cat,range(length)): 
    plt.subplot(np.ceil(length/2),2,j+1)
    plt.subplots_adjust(hspace=.5)
    df_count = data[i].value_counts()
    l = len(data[i])
    df_perc = df_count/l
    sns.barplot(df_perc.index, df_perc.values, alpha=0.7)
    plt.xticks(rotation=90)
    plt.title("Repartition of " + i.replace('_',' ')+ " (%)")

**b. % of Arrests = f(hour_day)**

In [18]:
# Group the results by the hour of day :
    # Arrests by hour
Arrests_per_hour = data.groupby(['hour'])['id'].count()
Arrests = data['id'].count()
Arrests_per_hour_perc = Arrests_per_hour/Arrests*100
fig_ar = plt.figure(figsize=(12,6))
sns.barplot(Arrests_per_hour_perc.index,Arrests_per_hour_perc.values, alpha=1.0)
plt.xlabel('Hour of the day', fontsize=14)
plt.ylabel('Percentage of Arrests (%)', fontsize=14)
plt.title('Ratio of Arrests in the day (%)')
plt.show()

**c. Age = f(hour_day)**

In [19]:
# Group the results by the hour of day :
    # Age Mean by hourMethod = 'A'
Age_by_hour = data.groupby(['hour'])['driver_age'].mean()
fig_ag = plt.figure(figsize=(12,6))
sns.barplot(Age_by_hour.index,Age_by_hour.values)
plt.xlabel('Hour of the day', fontsize=14)
plt.ylabel('Mean of the Age', fontsize=14)
plt.title('Average Age vs Hour of the Day')
plt.show()


**II. Conclusions**
    
- White represent the majority of controls followed by black and hispanic
        
**Question :** Is there a correlation between the number of search conducted by police and the driver race ?
           
- Speeding represents the majority of violation :
 
**Question :** Are there correlations between the violation type and river race ? Vehicle brand ?

- Citation represents the majority of stop outcome :
 
**Question :** Are there correlations between the stop outcome and Driver race ? Vehicle brand ?

**III. a. Distribution of driver age with driver race**

In [20]:
# Violin Plot of Arrest by Category
var_name_x = 'driver_race'
var_name_y = 'driver_age'
def Violin_Plot_Num(var_name_x,var_name_y):
    data[var_name_y] = data[var_name_y].astype(np.float64)
    col_order = np.sort(data[var_name_x].unique()).tolist()
    fig3 = plt.figure(figsize=(12,6))
    sns.violinplot(x=var_name_x , y=var_name_y, data=data,palette="Set3", order=col_order)
    plt.xlabel(var_name_x, fontsize=12)
    plt.ylabel('y', fontsize=12)
    plt.title("Distribution of " + var_name_y.replace('_',' ') +" variable with "+var_name_x.replace('_',' ') , fontsize=15)
    plt.show()
Violin_Plot_Num('driver_race','driver_age')

**b. Repartition (%) of Stop Outcome for each driver race** 

In [21]:
cat_race = Races
length=len(cat_race)
fig6=plt.subplots(figsize=(10,15))
for i,j in itertools.zip_longest(cat_race,range(length)): 
    plt.subplot(np.ceil(length/2),2,j+1)
    plt.subplots_adjust(hspace=1)
    df_outcome = data[data['driver_race']==i]['stop_outcome'].value_counts()
    out_total = len(data[data['driver_race']==i]['stop_outcome'])
    df_outcome_perc = df_outcome/out_total*100
    sns.barplot(df_outcome_perc.index, df_outcome_perc.values,palette="Set3", alpha=0.7)
    plt.xticks(rotation=90)
    plt.title("Repartion of Stop Outcome % for " + i)

**c. Ratio (%) of Search Conducted for each driver race**

In [22]:
cat_race = Races
length=len(cat_race)
fig6=plt.subplots(figsize=(10,15))
for i,j in itertools.zip_longest(cat_race,range(length)): 
    plt.subplot(np.ceil(length/2),2,j+1)
    plt.subplots_adjust(hspace=1)
    df_search = data[data['driver_race']==i]['search_conducted'].value_counts()
    search_total = len(data[data['driver_race']==i]['search_conducted'])
    df_search_perc = df_search/search_total*100
    sns.barplot(df_search_perc.index, df_search_perc.values,palette="Blues_d", alpha=0.7)
    plt.xticks(rotation=90)
    plt.title("Ratio of Search Conducted  % for " + i)

**d. Ratio of Contraband found when search is conducted (%)**

In [23]:
fig10 = plt.figure(figsize=(12,6))
df_sch = data[data['search_conducted']=='True'].groupby(['driver_race'])['contraband_found'].count()
df_contra_found = data[data['contraband_found']=='True'].groupby(['driver_race'])['contraband_found'].count()
df_contra_found_perc = df_contra_found/df_sch*100
sns.barplot(df_contra_found_perc.index, df_contra_found_perc.values,palette="Set2", alpha=0.7)
plt.xticks(rotation=90)
plt.title("Ratio of Contraband found when search is conducted (%) ")

**III. Conclusions**

- Driver Age : driver race has low impact in the age distribution, the majority of the distribution is roughly centered around 25-30 years old

 - Stop Outcome : it seems that driver race has no impact in the stop outcome split
    
 - Search Conducted : it seems that the ratio of search conducted is slightly higher for Black and Hispanic
        
 **Question :** is it justified by the fact that the ratio of contraband found when search are conducted
    
  - Ratio Contraband found / search conducted : white race has the highest probability of contraband found if search are conducted, the higher search conducted rate is not justified 

**Conclusion :** To increase the success rate in contraband control, search conducted rate has to be increased for white race drivers.

**IV. Vehicle Brand Based Study**

Objective : Vehicle brand is the first characteristics observed by policer officers before choosing to stop a car.
Finding a pattern between vehicle brands and violation can be helpful to efficiently target controls.

 **a. Top ten Vehicle Brands for Arrests  Number**

In [24]:
df_total_cars = data['vehicle_brand'].count()
df_10 = data.groupby(['vehicle_brand'])['id'].count().sort_values(axis=0,ascending=False).head(10)
df_10_perc = df_10/df_total_cars*100
df_total_speed = data[data['violation']=='Speeding']['violation'].count()
df_spd = data[data['violation']=='Speeding'].groupby(['vehicle_brand'])['violation'].count().sort_values(axis=0,ascending=False).head(10)
df_spd_perc = df_spd/df_total_speed*100

df_contra_found = data[data['contraband_found']=='True'].groupby(['vehicle_type'])['contraband_found'].count()
df_search = data[data['search_conducted']=='True'].groupby(['vehicle_type'])['contraband_found'].count()
df_search = df_search[(df_search.values>50)]
df_contra_found_perc = df_contra_found/df_search*100
df_contra_found_perc = df_contra_found_perc.sort_values(axis=0,ascending=False).head(10)

def H_barplot(dataframe,namex,namey,title,color):
    plt.figure(figsize=(12,6))
    sns.barplot(dataframe.values,dataframe.index,palette=color,orient="h").set_ylabel("Sequential")
    plt.xlabel(namex , fontsize=14)
    plt.ylabel(namey, fontsize=14)
    plt.title(title)
    plt.show()

H_barplot(df_10_perc,'% of Arrests','Vehicle Brands','Top ten vehicle brands for number of Arrests',"BuGn_d")


**b. Top ten Vehicle Brands for Speeding Violations**

In [25]:
H_barplot(df_spd_perc,'% of Speeding Violation','Vehicle Brands','Top ten vehicle brands for number of Speeding Violation',"Set2")

**c. Top fifteen Vehicle Models for Ratio of Contraband found when** 

In [26]:
H_barplot(df_contra_found_perc,'% of cases Contraband found', 'Vehicle Brands', 'Top fifteen vehicle brands for % of contraband found when search are conducted',"Set1")

**IV. Vehicle Brand Study Conclusions**

At this stage we cannot see a real pattern since the split of Speeding Violations respect the split of vehicle brands

But the last plot can provide interesting information for police officers to target vehicle brands in their
drug related controls.
