___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

<h1><p style="text-align: center;">Data Analysis with Python <br>Project - 1</p><h1> - Traffic Police Stops <img src="https://docs.google.com/uc?id=17CPCwi3_VvzcS87TOsh4_U8eExOhL6Ki" class="img-fluid" alt="CLRSWY" width="200" height="100"> 

Does the ``gender`` of a driver have an impact on police behavior during a traffic stop? **In this chapter**, you will explore that question while practicing filtering, grouping, method chaining, Boolean math, string methods, and more!

***

## Examining traffic violations

Before comparing the violations being committed by each gender, you should examine the ``violations`` committed by all drivers to get a baseline understanding of the data.

In this exercise, you'll count the unique values in the ``violation`` column, and then separately express those counts as proportions.

> Before starting your work in this section **repeat the steps which you did in the previos chapter for preparing the data.** Continue to this chapter based on where you were in the end of the previous chapter.

In [1]:
# Importing Pandas Library
import pandas as pd

# Avoiding unneccessary warnings
import warnings
warnings.filterwarnings('ignore')
warnings.warn("this will not show")

# Reading police.csv file, creating DataFrame named ri
ri = pd.read_csv('police.csv.zip', nrows=50000)

# Dropping county_name, county_fips, fine_grained_location,search_type_raw,search_type, states columns from DataFrame
ri.drop(['county_name','county_fips','fine_grained_location','search_type_raw','state'], axis=1, inplace=True)

# Dropping rows that contains missing values of driver_gender column
ri.dropna(subset=['driver_gender'], inplace=True)

# Changing is_arrested column data type from object to bool
ri['is_arrested'] = ri['is_arrested'].astype('bool')

# Concatenate stop_date and stop_time
ri['combined'] = ri['stop_date'].str.cat(ri['stop_time'], sep=' ')

# Converting combined column data format from object to datetime, and storing result in new stop_datetime coolumn
ri['stop_datetime'] = pd.to_datetime(ri['combined'])

# Setting stop_datetime column as index of the DataFrame
ri.set_index('stop_datetime', inplace=True)

# Dropping stop_date, stop_time, combined columns
ri.drop(['stop_date', 'stop_time', 'combined'], axis=1, inplace=True)

# Examining first five rows of the DataFrame
ri.head()

Unnamed: 0_level_0,id,location_raw,police_department,driver_gender,driver_age_raw,driver_age,driver_race_raw,driver_race,violation_raw,violation,search_conducted,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2005-01-02 01:55:00,RI-2005-00001,Zone K1,600,M,1985.0,20.0,W,White,Speeding,Speeding,False,,False,Citation,False,0-15 Min,False,False,Zone K1
2005-01-02 20:30:00,RI-2005-00002,Zone X4,500,M,1987.0,18.0,W,White,Speeding,Speeding,False,,False,Citation,False,16-30 Min,False,False,Zone X4
2005-01-04 12:55:00,RI-2005-00004,Zone X4,500,M,1986.0,19.0,W,White,Equipment/Inspection Violation,Equipment,False,,False,Citation,False,0-15 Min,False,False,Zone X4
2005-01-06 01:30:00,RI-2005-00005,Zone X4,500,M,1978.0,27.0,B,Black,Equipment/Inspection Violation,Equipment,False,,False,Citation,False,0-15 Min,False,False,Zone X4
2005-01-12 08:05:00,RI-2005-00006,Zone X1,0,M,1973.0,32.0,B,Black,Call for Service,Other,False,,False,Citation,False,30+ Min,True,False,Zone X1


**INSTRUCTIONS**

*   Count the unique values in the ``violation`` column, to see what violations are being committed by all drivers.
*   Express the violation counts as proportions of the total.

In [2]:
# Counting the number of each violation type
ri['violation'].value_counts(normalize=True)

Speeding               0.752156
Moving violation       0.135847
Equipment              0.062945
Registration/plates    0.030473
Other                  0.018579
Name: violation, dtype: float64

In [3]:
# Formatting float number to display 2 decimal digits
pd.options.display.float_format = '{:,.2f} %'.format

# Percentage of each violatin type
ri['violation'].value_counts()/len(ri['violation'])*100

Speeding              75.22 %
Moving violation      13.58 %
Equipment              6.29 %
Registration/plates    3.05 %
Other                  1.86 %
Name: violation, dtype: float64

In [4]:
# Speeding violation is the highest one with 75.22 percent, then Moving violation follows it with 13.58 percent.

***

## Comparing violations by gender

The question we're trying to answer is whether male and female drivers tend to commit different types of traffic violations.

You'll first create a ``DataFrame`` for each gender, and then analyze the ``violations`` in each ``DataFrame`` separately.

**INSTRUCTIONS**

*   Create a ``DataFrame``, female, that only contains rows in which ``driver_gender`` is ``'F'``.
*   Create a ``DataFrame``, male, that only contains rows in which ``driver_gender`` is ``'M'``.
*   Count the ``violations`` committed by female drivers and express them as proportions.
*   Count the violations committed by male drivers and express them as proportions.

In [5]:
# Calculating number of total violations by female drivers
female = ri[ri['driver_gender'] == 'F']
print('Number of violations by female drivers:', female.shape[0])

Number of violations by female drivers: 13309


In [6]:
# Calculating number of total violations by male drivers
male = ri[ri['driver_gender'] == 'M']
print('Number of violations by male drivers:', male.shape[0])

Number of violations by male drivers: 34701


In [7]:
# Formatting float number to display 2 decimal digits
pd.options.display.float_format = '{:,.2f} %'.format

print('The violation distrbution of FEMALE Drivers:')

# Percentage of each violatin type committed by female drivers
female['violation'].value_counts()/len(female['violation'])*100

The violation distrbution of FEMALE Drivers:


Speeding              81.12 %
Moving violation       9.90 %
Equipment              4.56 %
Registration/plates    2.76 %
Other                  1.66 %
Name: violation, dtype: float64

In [8]:
# Formatting float number to display 2 decimal digits
pd.options.display.float_format = '{:,.2f} %'.format

print('The violation distrbution of MALE Drivers:')

# Percentage of each violatin type committed by male drivers
male['violation'].value_counts()/len(male['violation'])*100

The violation distrbution of MALE Drivers:


Speeding              72.95 %
Moving violation      15.00 %
Equipment              6.96 %
Registration/plates    3.16 %
Other                  1.93 %
Name: violation, dtype: float64

In [9]:
# Violation percentage by gender

print('Violation percentage by female drivers:','{:.2f} %'.format(female.shape[0]/(female.shape[0]+female.shape[0]) * 100))
print('Violation percentage by male drivers:', '{:.2f} %'.format(male.shape[0]/(male.shape[0]+male.shape[0]) * 100))

Violation percentage by female drivers: 50.00 %
Violation percentage by male drivers: 50.00 %


In [10]:
# It seems male drivers made around triple times more violations than female ones in general
# Unexpectedly female drivers have tend to commit speed violation than male ones
# Female drivers are better equipped than males
# In terms of moving violation, male drivers tend to commit more

***

## Comparing speeding outcomes by gender

When a driver is pulled over for speeding, many people believe that gender has an impact on whether the driver will receive a ticket or a warning. Can you find evidence of this in the dataset?

First, you'll create two ``DataFrames`` of drivers who were stopped for ``speeding``: one containing ***females*** and the other containing ***males***.

Then, for each **gender**, you'll use the ``stop_outcome`` column to calculate what percentage of stops resulted in a ``"Citation"`` (meaning a ticket) versus a ``"Warning"``.

**INSTRUCTIONS**

*   Create a ``DataFrame``, ``female_and_speeding``, that only includes female drivers who were stopped for speeding.
*   Create a ``DataFrame``, ``male_and_speeding``, that only includes male drivers who were stopped for speeding.
*   Count the **stop outcomes** for the female drivers and express them as proportions.
*   Count the **stop outcomes** for the male drivers and express them as proportions.

In [11]:
# Creating female_and_speeding, that includes only female drivers, with speeding violation
female_and_speeding = female[female['violation'] == 'Speeding']
female_and_speeding.shape

(10796, 19)

In [12]:
# Creating male_and_speeding, that includes only male drivers, with speeding violation
male_and_speeding = male[male['violation'] == 'Speeding']
male_and_speeding.shape

(25315, 19)

In [13]:
#Calculating the percentage of each stop_outcome type in female_and_speeding dataframe
print('The stop_outcome of FEMALE speeding violation;')
round(female_and_speeding['stop_outcome'].value_counts()/len(female_and_speeding['violation'])*100, 2)

The stop_outcome of FEMALE speeding violation;


Citation           97.34 %
Arrest Driver       0.74 %
N/D                 0.36 %
Arrest Passenger    0.23 %
No Action           0.03 %
Name: stop_outcome, dtype: float64

In [14]:
#Calculating the percentage of each stop_outcome type in female_and_speeding dataframe
print('The stop_outcome of MALE speeding violation;')
round(male_and_speeding['stop_outcome'].value_counts()/len(male_and_speeding['violation'])*100, 2)

The stop_outcome of MALE speeding violation;


Citation           95.73 %
Arrest Driver       2.62 %
N/D                 0.34 %
Arrest Passenger    0.20 %
No Action           0.04 %
Name: stop_outcome, dtype: float64

In [15]:
# As seen above, there is not a huge difference between male and female drivers when it comes to taking a ticker or not

***

## Calculating the search rate

During a traffic stop, the police officer sometimes conducts a search of the vehicle. In this exercise, you'll calculate the percentage of all stops that result in a vehicle search, also known as the **search rate**.

**INSTRUCTIONS**

*   Check the data type of ``search_conducted`` to confirm that it's a ``Boolean Series``.
*   Calculate the search rate by counting the ``Series`` values and expressing them as proportions.
*   Calculate the search rate by taking the mean of the ``Series``. (It should match the proportion of ``True`` values calculated above.)

In [16]:
# Checking search_conducted column's data type
ri['search_conducted'].dtype

dtype('bool')

In [17]:
# Calculating search_conducted numbers
ri['search_conducted'].value_counts()

False    45998
True      2012
Name: search_conducted, dtype: int64

In [18]:
# Calculating search_conducted proportions
round(ri['search_conducted'].value_counts()/len(ri['search_conducted'])*100, 2)

False   95.81 %
True     4.19 %
Name: search_conducted, dtype: float64

In [19]:
# Calculating the search rate by taking the mean of the Series, checking it matches the True proportion value
ri['search_conducted'].mean()

0.04190793584669861

***

## Comparing search rates by gender

You'll compare the rates at which **female** and **male** drivers are searched during a traffic stop. Remember that the vehicle search rate across all stops is about **3.8%**.

First, you'll filter the ``DataFrame`` by gender and calculate the search rate for each group separately. Then, you'll perform the same calculation for both genders at once using a ``.groupby()``.

**INSTRUCTIONS 1/3**

*   Filter the ``DataFrame`` to only include **female** drivers, and then calculate the search rate by taking the mean of ``search_conducted``.

In [20]:
# Calculating search_conducted mean for female drivers
ri[ri['driver_gender']=='F']['search_conducted'].mean()

0.017807498685100308

**INSTRUCTIONS 2/3**

*   Filter the ``DataFrame`` to only include **male** drivers, and then repeat the search rate calculation.

In [21]:
# Calculating search_conducted mean for male drivers
ri[ri['driver_gender']=='M']['search_conducted'].mean()

0.05115126365234431

**INSTRUCTIONS 3/3**

*   Group by driver gender to calculate the search rate for both groups simultaneously. (It should match the previous results.)

In [22]:
# Grouping DataFrame by gender and finding search_conducted mean
(ri.groupby('driver_gender')[['search_conducted']].mean())*100

Unnamed: 0_level_0,search_conducted
driver_gender,Unnamed: 1_level_1
F,1.78 %
M,5.12 %


***

## Adding a second factor to the analysis

Even though the search rate for males is much higher than for females, it's possible that the difference is mostly due to a second factor.

For example, you might hypothesize that the search rate varies by violation type, and the difference in search rate between males and females is because they tend to commit different violations.

You can test this hypothesis by examining the search rate for each combination of gender and violation. If the hypothesis was true, you would find that males and females are searched at about the same rate for each violation. Find out below if that's the case!

**INSTRUCTIONS 1/2**

*   Use a ``.groupby()`` to calculate the search rate for each combination of gender and violation. Are males and females searched at about the same rate for each violation?

In [23]:
# Calculating proportions of searched_conducted for each violation by gender
(ri.groupby(['driver_gender', 'violation'])[['search_conducted']].mean())*100

Unnamed: 0_level_0,Unnamed: 1_level_0,search_conducted
driver_gender,violation,Unnamed: 2_level_1
F,Equipment,7.91 %
F,Moving violation,4.78 %
F,Other,4.52 %
F,Registration/plates,11.44 %
F,Speeding,0.69 %
M,Equipment,12.34 %
M,Moving violation,8.88 %
M,Other,15.50 %
M,Registration/plates,17.15 %
M,Speeding,2.86 %


**INSTRUCTIONS 2/2**

*   Reverse the ordering to group by violation before gender. The results may be easier to compare when presented this way.

In [24]:
# Calculating proportions of searched_conducted for each gender by violation
(ri.groupby(['violation', 'driver_gender'])[['search_conducted']].mean())*100

Unnamed: 0_level_0,Unnamed: 1_level_0,search_conducted
violation,driver_gender,Unnamed: 2_level_1
Equipment,F,7.91 %
Equipment,M,12.34 %
Moving violation,F,4.78 %
Moving violation,M,8.88 %
Other,F,4.52 %
Other,M,15.50 %
Registration/plates,F,11.44 %
Registration/plates,M,17.15 %
Speeding,F,0.69 %
Speeding,M,2.86 %


***

## Counting protective frisks

During a vehicle search, the police officer may pat down the driver to check if they have a weapon. This is known as a ``"protective frisk."``

You'll first check to see how many times "Protective Frisk" was the only search type. Then, you'll use a string method to locate all instances in which the driver was frisked.

**INSTRUCTIONS**

*   Count the ``search_type`` values to see how many times ``"Protective Frisk"`` was the only search type.
*   Create a new column, frisk, that is ``True`` if ``search_type`` contains the string ``"Protective Frisk"`` and ``False`` otherwise.
*   Check the data type of frisk to confirm that it's a ``Boolean Series``.
*   Take the sum of frisk to count the total number of frisks.

In [25]:
# Counting the Protective Frisk number in search_type
ri['search_type'].value_counts()

Incident to Arrest                                          958
Probable Cause                                              244
Protective Frisk                                            204
Inventory                                                   117
Incident to Arrest,Inventory                                116
Incident to Arrest,Probable Cause                            76
Incident to Arrest,Protective Frisk                          63
Reasonable Suspicion                                         43
Probable Cause,Protective Frisk                              36
Incident to Arrest,Inventory,Protective Frisk                33
Inventory,Protective Frisk                                   23
Incident to Arrest,Probable Cause,Protective Frisk           20
Incident to Arrest,Inventory,Probable Cause                  19
Inventory,Probable Cause                                     16
Protective Frisk,Reasonable Suspicion                        16
Probable Cause,Reasonable Suspicion     

In [26]:
# Creating a new column named 'frisk' containg all rows that include 'Protective Frisk'
ri['frisk']=ri['search_type'].str.contains(pat='Protective Frisk', na=False)

In [27]:
ri['frisk']

stop_datetime
2005-01-02 01:55:00    False
2005-01-02 20:30:00    False
2005-01-04 12:55:00    False
2005-01-06 01:30:00    False
2005-01-12 08:05:00    False
                       ...  
2006-08-08 22:45:00    False
2006-08-08 22:45:00    False
2006-08-08 22:53:00    False
2006-08-08 23:00:00    False
2006-08-08 23:00:00    False
Name: frisk, Length: 48010, dtype: bool

In [28]:
# Checking the data type of frisk column
ri['frisk'].dtype

dtype('bool')

In [29]:
# Taking sum of frisk column entries
ri['frisk'].sum()

403

***

## Comparing frisk rates by gender

You'll compare the rates at which female and male drivers are frisked during a search. Are males frisked more often than females, perhaps because police officers consider them to be higher risk?

Before doing any calculations, it's important to filter the ``DataFrame`` to only include the relevant subset of data, namely stops in which a search was conducted.

**INSTRUCTIONS**

*   Create a ``DataFrame``, searched, that only contains rows in which ``search_conducted`` is ``True``.
*   Take the mean of the frisk column to find out what percentage of searches included a frisk.
*   Calculate the frisk rate for each gender using a ``.groupby()``.

In [30]:
# Creating searched DataFrame that search_conducted is True
searched=ri[ri['search_conducted']==True]

In [31]:
# Calculating the shape of searched DataFrame
searched.shape

(2012, 20)

In [32]:
# Calculating the mean of frisk in searched DataFrame
searched['frisk'].mean()

0.20029821073558648

In [33]:
# First five rows of searched DataFrame
searched.head()

Unnamed: 0_level_0,id,location_raw,police_department,driver_gender,driver_age_raw,driver_age,driver_race_raw,driver_race,violation_raw,violation,search_conducted,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district,frisk
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2005-01-24 20:32:00,RI-2005-00010,Zone K1,600,M,"1,987.00 %",18.00 %,W,White,Speeding,Speeding,True,Probable Cause,True,Citation,False,0-15 Min,True,True,Zone K1,False
2005-02-09 03:05:00,RI-2005-00011,Zone X4,500,M,"1,976.00 %",29.00 %,W,White,Registration Violation,Registration/plates,True,"Probable Cause,Protective Frisk",False,Citation,False,0-15 Min,False,False,Zone X4,True
2005-08-28 01:00:00,RI-2005-00084,Zone X1,0,M,"1,979.00 %",26.00 %,W,White,Other Traffic Violation,Moving violation,True,"Incident to Arrest,Protective Frisk",False,Arrest Driver,True,16-30 Min,True,False,Zone X1,True
2005-09-15 02:20:00,RI-2005-00094,Zone X4,500,M,"1,988.00 %",17.00 %,W,White,Other Traffic Violation,Moving violation,True,Incident to Arrest,False,Arrest Driver,True,16-30 Min,False,False,Zone X4,False
2005-09-24 02:20:00,RI-2005-00115,Zone K3,300,M,"1,987.00 %",18.00 %,W,White,Other Traffic Violation,Moving violation,True,Incident to Arrest,False,Arrest Driver,True,16-30 Min,False,False,Zone K3,False


In [34]:
# Calculating the frisk rate for gender
searched.groupby('driver_gender')[['frisk']].mean()

Unnamed: 0_level_0,frisk
driver_gender,Unnamed: 1_level_1
F,0.16 %
M,0.21 %


In [35]:
# It seem there is only 5% difference between female and male drivers as for Protective Frisk search