# 2. Exploring the relationship between gender and policing
**Does the gender of a driver have an impact on police behavior during a traffic stop? In this chapter, you will explore that question while practicing filtering, grouping, method chaining, Boolean math, string methods, and more!**

In [3]:
import pandas as pd
ri = pd.read_csv('police.csv')

ri.drop(['county_name', 'state'], axis='columns', inplace=True)
ri.dropna(subset=['driver_gender'], inplace=True)

ri['is_arrested'] = ri.is_arrested.astype('bool')

combined = ri.stop_date.str.cat(ri.stop_time, sep=' ')
ri['stop_datetime'] = pd.to_datetime(combined)

ri.set_index('stop_datetime', inplace=True)

## Do the genders commit different violations?
In this chapter, you'll use the dataset to explore the relationship between gender and policing, and you'll practice figuring out how to use pandas to answer specific questions.
 
### Counting unique values (1)
Let's start by discussing a few methods that will help with the analysis. 

The first method is `value_counts()`, which counts the unique values in a Series. It's best suited for a column that contains categorical rather than numerical data. 

For example, we can apply `value_counts()` to the stop_outcome column, which contains the outcome of each traffic stop.

In [4]:
ri.stop_outcome.value_counts()

Citation            77091
Arrest Driver        2735
No Action             624
N/D                   607
Arrest Passenger      343
Name: stop_outcome, dtype: int64

The results are displayed in descending order, so you can see that the most common outcome is a citation, also known as a ticket, and the second most common outcome is a `warning`.

### Counting unique values (2)
Because `value_counts()` outputs a pandas Series, you can take the sum of this Series by simply adding the `sum()` method on the end. This is known as method chaining, a powerful technique.

In [5]:
ri.stop_outcome.value_counts().sum()

86536

The `sum()` of the `value_counts()` is actually equal to the number of rows in the DataFrame, which will be the case for any Series that has no missing values.

In [6]:
ri.shape

(86536, 13)

### Expressing counts as proportions
Rather than examining the raw counts, you might prefer to see the stop outcomes as proportions of the total. So if you wanted to know what percentage of traffic stops ended in a citation, you would divide the number of citations by the total number of outcomes and get 0.89, or 89%.

In [8]:
ri.stop_outcome.value_counts()

Citation            77091
Arrest Driver        2735
No Action             624
N/D                   607
Arrest Passenger      343
Name: stop_outcome, dtype: int64

In [9]:
77091/86536

0.8908546731995932

Rather than doing these calculations manually, you can instead set the `normalize` parameter of `value_counts()` to be `True`, and it will output proportions instead of counts. 

In [10]:
ri.stop_outcome.value_counts(normalize=True)

Citation            0.890855
Arrest Driver       0.031605
No Action           0.007211
N/D                 0.007014
Arrest Passenger    0.003964
Name: stop_outcome, dtype: float64

Citations are 89%, warnings are 6%, driver arrests are 3%, and so on.

### Filtering DataFrame rows
Let's now take a look at the `value_counts()` for a different column, driver_race. You can see that there are five unique categories present. 

In [11]:
ri.driver_race.value_counts()

White       61870
Black       12285
Hispanic     9727
Asian        2389
Other         265
Name: driver_race, dtype: int64

If you wanted to filter the DataFrame to only include drivers of a particular race, such as White, you would write that as a condition and put it inside brackets. We'll save the result in a new object. 

In [13]:
white = ri[ri.driver_race == 'White']
white.shape

(61870, 13)

The shape of the new DataFrame is 61,870 rows, because that's the number of White drivers in the dataset, and 13 columns. You can now analyze this smaller DataFrame separately.

### Comparing stop outcomes for two groups
For example, you could repeat the analysis of stop outcomes, but focus on White drivers only. Like before, you select the `stop_outcome` column and then chain the `value_counts()` method on the end. 

In [14]:
white.stop_outcome.value_counts(normalize=True)

Citation            0.902263
Arrest Driver       0.024018
No Action           0.007031
N/D                 0.006433
Arrest Passenger    0.002748
Name: stop_outcome, dtype: float64

You could compare these results with the outcomes for another race, such as Asian, simply by changing the condition inside the brackets and then repeating the calculation. 

In [15]:
asian = ri[ri.driver_race == 'Asian']
asian.stop_outcome.value_counts(normalize=True)

Citation            0.922980
Arrest Driver       0.017581
No Action           0.008372
N/D                 0.004186
Arrest Passenger    0.001674
Name: stop_outcome, dtype: float64

If you compare these two sets of numbers, you can see that the stop outcomes are fairly similar for these two groups.

If doing the same analysis for others two race groups, Black and Hispanic,

In [16]:
black = ri[ri.driver_race == 'Black']
black.stop_outcome.value_counts(normalize=True)

Citation            0.857224
Arrest Driver       0.054294
N/D                 0.008547
Arrest Passenger    0.008303
No Action           0.006512
Name: stop_outcome, dtype: float64

In [17]:
hispanic = ri[ri.driver_race == 'Hispanic']
hispanic.stop_outcome.value_counts(normalize=True)

Citation            0.852061
Arrest Driver       0.055310
N/D                 0.009458
No Action           0.008841
Arrest Passenger    0.006888
Name: stop_outcome, dtype: float64

 the stop outcomes are similar for these two groups.

## Examining traffic violations
Before comparing the violations being committed by each gender, you should examine the violations committed by all drivers to get a baseline understanding of the data.

In this exercise, you'll count the unique values in the `violation` column, and then separately express those counts as proportions.

- Count the unique values in the `violation` column of the `ri` DataFrame, to see what violations are being committed by all drivers.
- Express the violation counts as proportions of the total.


In [19]:
# Count the unique values in 'violation'
print(ri.violation.value_counts())

Speeding               48423
Moving violation       16224
Equipment              10921
Other                   4409
Registration/plates     3703
Seat belt               2856
Name: violation, dtype: int64


In [20]:
# Express the counts as proportions
print(ri.violation.value_counts(normalize=True))

Speeding               0.559571
Moving violation       0.187483
Equipment              0.126202
Other                  0.050950
Registration/plates    0.042791
Seat belt              0.033004
Name: violation, dtype: float64


*More than half of all violations are for speeding, followed by other moving violations and equipment violations.*

## Comparing violations by gender
The question we're trying to answer is whether male and female drivers tend to commit different types of traffic violations.

In this exercise, you'll first create a DataFrame for each gender, and then analyze the violations in each DataFrame separately.

- Create a DataFrame, `female`, that only contains rows in which `driver_gender` is `'F'`.
- Create a DataFrame, `male`, that only contains rows in which `driver_gender` is `'M'`.
- Count the violations committed by female drivers and express them as proportions.
- Count the violations committed by male drivers and express them as proportions.

In [21]:
# Create a DataFrame of female drivers
female = ri[ri.driver_gender == 'F']

# Create a DataFrame of male drivers
male = ri[ri.driver_gender == 'M']

# Compute the violations by female drivers (as proportions)
print(female.violation.value_counts(normalize=True))

Speeding               0.658114
Moving violation       0.138218
Equipment              0.105199
Registration/plates    0.044418
Other                  0.029738
Seat belt              0.024312
Name: violation, dtype: float64


In [22]:
# Compute the violations by male drivers (as proportions)
print(male.violation.value_counts(normalize=True))

Speeding               0.522243
Moving violation       0.206144
Equipment              0.134158
Other                  0.058985
Registration/plates    0.042175
Seat belt              0.036296
Name: violation, dtype: float64


*About two-thirds of female traffic stops are for speeding, whereas stops of males are more balanced among the six categories.*

*This doesn't mean that females speed more often than males, however, since we didn't take into account the number of stops or drivers.*

---
## Does gender affect who gets a ticket for speeding?
In this section, we'll narrow our focus to the relationship between gender and stop outcomes for one specific violation, namely speeding.
 
### Filtering by multiple conditions (1)
We'll need to use one additional technique for this analysis, namely filtering a DataFrame by multiple conditions. 

In the last exercise, you used a single condition, driver_gender equals F, to create a DataFrame of female drivers. It has 23,774 rows because that's the number of rows in the ri DataFrame that satisfy this condition.

In [23]:
female.shape

(23774, 13)

### Filtering by multiple conditions (2)
What if we wanted to create a second DataFrame of female drivers, but only those who were arrested? We simply add a second condition to the filter, namely that the `is_arrested` column equals True. Notice that each condition is surrounded by parentheses, and there is an `&` between the two conditions, which represents the logical `AND` operator.

In [26]:
female_and_arrested = ri[(ri.driver_gender == 'F') & (ri.is_arrested == True)]
female_and_arrested.shape

(669, 13)

The second DataFrame is much smaller because it only includes rows that satisfy both conditions, meaning that it only includes female drivers who were also arrested.

### Filtering by multiple conditions (3)
When filtering a DataFrame by multiple conditions, another option is to use the vertical pipe character between the two conditions. The `|` represents the logical `OR` operator, which indicates that a row should be included in the DataFrame if it meets either condition. 

In [27]:
female_or_arrested = ri[(ri.driver_gender == 'F') | (ri.is_arrested == True)]
female_or_arrested.shape

(26183, 13)

This DataFrame is larger than the last one because it includes all females regardless of whether they were arrested, as well as all drivers who were arrested, regardless of whether they are female.

### Rules for filtering by multiple conditions
Here's a quick summary of the rules for filtering DataFrames by multiple conditions. 

- Use the `&` to only include rows that satisfy both conditions.
- Use the `|` to include rows that satisfy either condition. 
- Each condition must be surrounded by parentheses. 
- Conditions can check for equality(`==`), inequality(`!=`), greater than, less than, and so on. And you can use more than two conditions to create a filter.

### Correlation, not causation
In the next exercises, you'll analyze the relationship between gender and stop outcome when a driver is pulled over for speeding. In other words, you're examining the data to assess whether there is a correlation between these two attributes. However, it's important to note that we're not going to draw any conclusions about causation during this course, since we don't have the data or the expertise required to do so. Instead, we're simply exploring the relationships between different attributes in the dataset.

## Comparing speeding outcomes by gender
When a driver is pulled over for speeding, many people believe that gender has an impact on whether the driver will receive a ticket or a warning. Can you find evidence of this in the dataset?

First, you'll create two DataFrames of drivers who were stopped for speeding: one containing females and the other containing males.

Then, for each gender, you'll use the `stop_outcome` column to calculate what percentage of stops resulted in a "Citation" (meaning a ticket) versus a "Warning".

- Create a DataFrame, `female_and_speeding`, that only includes female drivers who were stopped for speeding.
- Create a DataFrame, `male_and_speeding`, that only includes male drivers who were stopped for speeding.
- Count the stop outcomes for the female drivers and express them as proportions.
- Count the stop outcomes for the male drivers and express them as proportions.

In [28]:
# Create a DataFrame of female drivers stopped for speeding
female_and_speeding = ri[(ri.driver_gender == 'F') & (ri.violation == 'Speeding')]

# Create a DataFrame of male drivers stopped for speeding
male_and_speeding = ri[(ri.driver_gender == 'M') & (ri.violation == 'Speeding')]

# Compute the stop outcomes for female drivers (as proportions)
print(female_and_speeding.stop_outcome.value_counts(normalize=True))

Citation            0.952192
Arrest Driver       0.005752
N/D                 0.000959
Arrest Passenger    0.000639
No Action           0.000383
Name: stop_outcome, dtype: float64


In [29]:
# Compute the stop outcomes for male drivers (as proportions)
print(male_and_speeding.stop_outcome.value_counts(normalize=True))

Citation            0.944595
Arrest Driver       0.015895
Arrest Passenger    0.001281
No Action           0.001068
N/D                 0.000976
Name: stop_outcome, dtype: float64


*The numbers are similar for males and females: about 95% of stops for speeding result in a ticket. Thus, the data fails to show that gender has an impact on who gets a ticket for speeding.*

---
## Does gender affect whose vehicle is searched?
During a traffic stop, the police officer sometimes conducts a search of the vehicle. Does the driver's gender affect whether their vehicle is searched? Let's review a few pandas techniques that will help us to answer this question.

### Math with Boolean values
We previously used the `isnull()` method to generate a DataFrame of `True` and `False` values, and then took the `sum()` to count the missing values in each column. 

In [30]:
ri.isnull().sum()

stop_date                 0
stop_time                 0
driver_gender             0
driver_race               0
violation_raw             0
violation                 0
search_conducted          0
search_type           83229
stop_outcome              0
is_arrested               0
stop_duration             0
drugs_related_stop        0
district                  0
dtype: int64

This worked because True values were treated as ones and False values were treated as zeros. 

Now we'll use the NumPy library to demonstrate a different operation, namely the mean. If you take the `mean()` of the list `0 1 0 0` you'll get 0.25, calculated as 1 divided by 4. 

In [31]:
import numpy as np
np.mean([0, 1, 0, 0])

0.25

Similarly, if you take the `mean()` of the list `False True False False`, you'll also get 0.25. 

In [32]:
np.mean([False, True, False, False])

0.25

Thus, the mean of a Boolean Series represents the percentage of values that are `True`.

### Taking the mean of a Boolean Series
Now, let's see a real example of why it's useful to be able to take the mean of a Boolean Series. 

First, calculate the percentage of stops that result in an arrest using the `value_counts()` method. 

In [34]:
ri.is_arrested.value_counts(normalize=True)

False    0.964431
True     0.035569
Name: is_arrested, dtype: float64

The arrest rate is around 3.6% since that's the percentage of True values. Note that this would work on an object column or a Boolean column. But we can get the same result more easily by taking the `mean()` of the `is_arrested` Series. 

In [35]:
ri.is_arrested.mean()

0.0355690117407784

This method only works because the data type is Boolean. 

In [36]:
ri.is_arrested.dtype

dtype('bool')

This is exactly why the data type of this Series was changed from object to Boolean before.

### Comparing groups using groupby (1)
The second technique we'll review is `groupby()`. 

Let's pretend that you wanted to study the arrest rate by police district. You can see that there are six districts by using the Series method `unique()`. 

In [37]:
ri.district.unique()

array(['Zone X4', 'Zone K3', 'Zone X1', 'Zone X3', 'Zone K1', 'Zone K2'],
      dtype=object)

One approach we've used to compare groups is to filter the DataFrame by each group, and then perform a calculation on each subset. So to calculate the arrest rate in `Zone K1`, we would filter by that district, select the `is_arrested` column, and then take the `mean()`. 

In [38]:
ri[ri.district == 'Zone K1'].is_arrested.mean()

0.024349083895853423

The arrest rate is about 2.4%, which is lower than the overall arrest rate of 3.6%.

### Comparing groups using groupby (2)
Next we calculate the arrest rate in `Zone K2`, 

In [39]:
ri[ri.district == 'Zone K2'].is_arrested.mean()

0.030800588834786546

which is about 3.1%. 

But rather than repeating this process for all six districts, we can instead group by the district column, which will perform the same calculation for all districts at once. 

In [40]:
ri.groupby('district').is_arrested.mean()

district
Zone K1    0.024349
Zone K2    0.030801
Zone K3    0.032311
Zone X1    0.023494
Zone X3    0.034871
Zone X4    0.048038
Name: is_arrested, dtype: float64

You can see a noticeably higher arrest rate in `Zone X4`.

### Grouping by multiple categories
You can also group by multiple categories at once. For example, you can group by district and gender by passing it as a list of strings. 

In [41]:
ri.groupby(['district', 'driver_gender']).is_arrested.mean()

district  driver_gender
Zone K1   F                0.019169
          M                0.026588
Zone K2   F                0.022196
          M                0.034285
Zone K3   F                0.025156
          M                0.034961
Zone X1   F                0.019646
          M                0.024563
Zone X3   F                0.027188
          M                0.038166
Zone X4   F                0.042149
          M                0.049956
Name: is_arrested, dtype: float64

This computes the arrest rate for every combination of district and gender. In other words, you can see the arrest rate for males and females in each district separately. 

Note that if you reverse the ordering of the items in the list, grouping first by gender and then by district, the calculations will be the same but the presentation of the results will be different.

In [42]:
ri.groupby(['driver_gender', 'district']).is_arrested.mean()

driver_gender  district
F              Zone K1     0.019169
               Zone K2     0.022196
               Zone K3     0.025156
               Zone X1     0.019646
               Zone X3     0.027188
               Zone X4     0.042149
M              Zone K1     0.026588
               Zone K2     0.034285
               Zone K3     0.034961
               Zone X1     0.024563
               Zone X3     0.038166
               Zone X4     0.049956
Name: is_arrested, dtype: float64

 You can use whichever option makes it easier for you to understand the results.

## Calculating the search rate
During a traffic stop, the police officer sometimes conducts a search of the vehicle. In this exercise, you'll calculate the percentage of all stops in the `ri` DataFrame that result in a vehicle search, also known as the search rate.

- Check the data type of `search_conducted` to confirm that it's a Boolean Series.
- Calculate the search rate by counting the Series values and expressing them as proportions.
- Calculate the search rate by taking the mean of the Series. (It should match the proportion of `True` values calculated above.)

In [44]:
# Check the data type of 'search_conducted'
print(ri.search_conducted.dtype)

bool


In [45]:
# Calculate the search rate by counting the values
print(ri.search_conducted.value_counts(normalize=True))

False    0.961785
True     0.038215
Name: search_conducted, dtype: float64


In [46]:
# Calculate the search rate by taking the mean
print(ri.search_conducted.mean())

0.0382153092354627


*It looks like the search rate is about 3.8%.*

## Comparing search rates by gender
In this exercise, you'll compare the rates at which female and male drivers are searched during a traffic stop. Remember that the vehicle search rate across all stops is about 3.8%.

First, you'll filter the DataFrame by gender and calculate the search rate for each group separately. Then, you'll perform the same calculation for both genders at once using a `.groupby()`.

- Filter the DataFrame to only include female drivers, and then calculate the search rate by taking the mean of `search_conducted`.

In [47]:
# Calculate the search rate for female drivers
print(ri[ri.driver_gender == 'F'].search_conducted.mean())

0.019180617481282074


- Filter the DataFrame to only include male drivers, and then repeat the search rate calculation.

In [48]:
# Calculate the search rate for male drivers
print(ri[ri.driver_gender == 'M'].search_conducted.mean())

0.04542557598546892


- Group by driver gender to calculate the search rate for both groups simultaneously. (It should match the previous results.)

In [49]:
# Calculate the search rate for both groups simultaneously
print(ri.groupby('driver_gender').search_conducted.mean())

driver_gender
F    0.019181
M    0.045426
Name: search_conducted, dtype: float64


*Male drivers are searched more than twice as often as female drivers.*

*Why might this be?*

## Adding a second factor to the analysis
Even though the search rate for males is much higher than for females, it's possible that the difference is mostly due to a second factor.

For example, you might hypothesize that the search rate varies by violation type, and the difference in search rate between males and females is because they tend to commit different violations.

You can test this hypothesis by examining the search rate for each combination of gender and violation. If the hypothesis was true, you would find that males and females are searched at about the same rate for each violation. Find out below if that's the case!

- Use a `.groupby()` to calculate the search rate for each combination of gender and violation. Are males and females searched at about the same rate for each violation?

In [50]:
# Calculate the search rate for each combination of gender and violation
print(ri.groupby(['driver_gender', 'violation']).search_conducted.mean())

driver_gender  violation          
F              Equipment              0.039984
               Moving violation       0.039257
               Other                  0.041018
               Registration/plates    0.054924
               Seat belt              0.017301
               Speeding               0.008309
M              Equipment              0.071496
               Moving violation       0.061524
               Other                  0.046191
               Registration/plates    0.108802
               Seat belt              0.035119
               Speeding               0.027885
Name: search_conducted, dtype: float64


- Reverse the ordering to group by violation before gender. The results may be easier to compare when presented this way.

In [51]:
# Reverse the ordering to group by violation before gender
print(ri.groupby(['violation', 'driver_gender']).search_conducted.mean())

violation            driver_gender
Equipment            F                0.039984
                     M                0.071496
Moving violation     F                0.039257
                     M                0.061524
Other                F                0.041018
                     M                0.046191
Registration/plates  F                0.054924
                     M                0.108802
Seat belt            F                0.017301
                     M                0.035119
Speeding             F                0.008309
                     M                0.027885
Name: search_conducted, dtype: float64


*For all types of violations, the search rate is higher for males than for females, disproving our hypothesis.*

---
## Does gender affect who is frisked during a search?
In this section, we'll take a look at what happens during a search.

### Examining the search types
The `search_conducted` field is True if there's a search during a traffic stop, and False otherwise. 

In [53]:
ri.search_conducted.value_counts()

False    83229
True      3307
Name: search_conducted, dtype: int64

There's also a related field, `search_type`, that contains additional information about the search. 

In [55]:
ri.search_type.value_counts(dropna=False)

NaN                                                         83229
Incident to Arrest                                           1290
Probable Cause                                                924
Inventory                                                     219
Reasonable Suspicion                                          214
Protective Frisk                                              164
Incident to Arrest,Inventory                                  123
Incident to Arrest,Probable Cause                             100
Probable Cause,Reasonable Suspicion                            54
Incident to Arrest,Inventory,Probable Cause                    35
Probable Cause,Protective Frisk                                35
Incident to Arrest,Protective Frisk                            33
Inventory,Probable Cause                                       25
Protective Frisk,Reasonable Suspicion                          19
Incident to Arrest,Inventory,Protective Frisk                  18
Incident t

Notice that the `search_type` field has 83,229 missing values, which is identical to the number of False values in the `search_conducted` field. That's because any time a search is not conducted, there's no information to record about a search, and thus the `search_type` will be missing. Note that the `value_counts()` method excludes missing values by default, and so we specified `dropna=False` in order to see the missing values.

### Examining the search types
There are only five possible values for `search_type`, 

In [56]:
ri.search_type.value_counts()

Incident to Arrest                                          1290
Probable Cause                                               924
Inventory                                                    219
Reasonable Suspicion                                         214
Protective Frisk                                             164
Incident to Arrest,Inventory                                 123
Incident to Arrest,Probable Cause                            100
Probable Cause,Reasonable Suspicion                           54
Probable Cause,Protective Frisk                               35
Incident to Arrest,Inventory,Probable Cause                   35
Incident to Arrest,Protective Frisk                           33
Inventory,Probable Cause                                      25
Protective Frisk,Reasonable Suspicion                         19
Incident to Arrest,Inventory,Protective Frisk                 18
Incident to Arrest,Probable Cause,Protective Frisk            13
Inventory,Protective Fris

which you can see at the top of the `value_counts()` output: Incident to Arrest, Probable Cause, Inventory, Reasonable Suspicion, and Protective Frisk. But sometimes, multiple values are relevant for a single traffic stop, in which case they're **separated by commas**. 

Let's focus on Inventory, meaning searches in which the police took an inventory of the vehicle. Looking at the third line of the `value_counts()` output, we see 219, which is the number of searches in which Inventory was the only search type. 

But what if we wanted to know the total number of times in which an inventory was done during a search? We'd also have to include any stops in which Inventory was one of multiple search types. To do this, we'll use a string method.

### Searching for a string (1)
We'll use a string method called `contains()` that checks whether a string is present in each element of a given column. It returns `True` if the string is found, and `False` if it's not found. We also specify `na=False`, which tells the `contains()` method to return `False` when it finds a missing value in the search_type column. We'll save the results in a new column called inventory.

In [57]:
ri['inventory'] = ri.search_type.str.contains('Inventory', na=False)

### Searching for a string (2)
As expected, the data type of the column is Boolean. 

In [59]:
ri.inventory.dtype

dtype('bool')

To be clear, a `True` value in this column means that an inventory was done during a search, and a `False` value means it was not. We can take the `sum()` of the inventory column to see that an inventory was done during 441 searches.

In [60]:
ri.inventory.sum()

441

This includes the 219 stops in which Inventory was the only search type, plus additional stops in which Inventory was one of multiple search types.

### Calculating the inventory rate
What if we wanted to calculate the percentage of searches which included an inventory? You might think this would be as simple as taking the `mean()` of the inventory column,

In [61]:
ri.inventory.mean()

0.0050961449570121106

and the answer would be about 0.5%. But what's wrong with this calculation? 

0.5% is the percentage of all traffic stops which resulted in an inventory, including those stops in which a search was not even done. 

Instead, we first need to filter the DataFrame to only include those rows in which a search was done, and then take the `mean()` of the inventory column.

In [62]:
searched = ri[ri.search_conducted == True]
searched.inventory.mean()

0.13335349259147264

The correct answer is that 13.3% of searches included an inventory. 

This is a vastly different result, and it highlights the importance of carefully choosing which rows are relevant before doing a calculation.

## Counting protective frisks
During a vehicle search, the police officer may pat down the driver to check if they have a weapon. This is known as a "protective frisk."

In this exercise, you'll first check to see how many times "Protective Frisk" was the only search type. Then, you'll use a string method to locate all instances in which the driver was frisked.

- Count the `search_type` values in the ri DataFrame to see how many times "Protective Frisk" was the only search type.

In [64]:
# Count the 'search_type' values
print(ri.search_type.value_counts())

Incident to Arrest                                          1290
Probable Cause                                               924
Inventory                                                    219
Reasonable Suspicion                                         214
Protective Frisk                                             164
Incident to Arrest,Inventory                                 123
Incident to Arrest,Probable Cause                            100
Probable Cause,Reasonable Suspicion                           54
Probable Cause,Protective Frisk                               35
Incident to Arrest,Inventory,Probable Cause                   35
Incident to Arrest,Protective Frisk                           33
Inventory,Probable Cause                                      25
Protective Frisk,Reasonable Suspicion                         19
Incident to Arrest,Inventory,Protective Frisk                 18
Incident to Arrest,Probable Cause,Protective Frisk            13
Inventory,Protective Fris

- Create a new column, `frisk`, that is True if `search_type` contains the string `"Protective Frisk"` and `False` otherwise.

In [68]:
# Check if 'search_type' contains the string 'Protective Frisk'
ri['frisk'] = ri.search_type.str.contains('Protective Frisk', na=False)

- Check the data type of `frisk` to confirm that it's a Boolean Series.

In [69]:
# Check the data type of 'frisk'
print(ri.frisk.dtype)

bool


- Take the sum of `frisk` to count the total number of frisks.

In [70]:
# Take the sum of 'frisk'
print(ri.frisk.sum())

303


*It looks like there were 303 drivers who were frisked.*

*Next, you'll examine whether gender affects who is frisked.*

## Comparing frisk rates by gender
In this exercise, you'll compare the rates at which female and male drivers are frisked during a search. Are males frisked more often than females, perhaps because police officers consider them to be higher risk?

Before doing any calculations, it's important to filter the DataFrame to only include the relevant subset of data, namely stops in which a search was conducted.

- Create a DataFrame, `searched`, that only contains rows in which `search_conducted` is `True`.

In [71]:
# Create a DataFrame of stops in which a search was conducted
searched = ri[ri.search_conducted == True]

- Take the mean of the `frisk` column to find out what percentage of searches included a frisk.

In [72]:
# Calculate the overall frisk rate by taking the mean of 'frisk'
print(searched.frisk.mean())

0.09162382824312065


- Calculate the frisk rate for each gender using a `.groupby()`.

In [74]:
# Calculate the frisk rate for each gender
print(searched.groupby('driver_gender').frisk.mean())

driver_gender
F    0.074561
M    0.094353
Name: frisk, dtype: float64


*The frisk rate is higher for males than for females, though we can't conclude that this difference is caused by the driver's gender.*