# OPEN POLICING DATASET EDA FOLLOWING KEVİN MARKHAM'S PYCON PRESENTATION

**This notebook includes my, and the real answers for Kevin Markham's Pycon presentation. It also includes my own additions related to the data set. It is intended to be purely instructive.**

    GitHub repository: https://github.com/justmarkham/pycon-2018-tutorial
    Instructor: Kevin Markham
    GitHub: https://github.com/justmarkham
    Twitter: https://twitter.com/justmarkham
    YouTube: https://www.youtube.com/dataschool
    Website: http://www.dataschool.io

# ABOUT DATA
### Context
On a typical day in the United States, police officers make more than 50,000 traffic stops. Our team is gathering, analyzing, and releasing records from millions of traffic stops by law enforcement agencies across the country. Our goal is to help researchers, journalists, and policymakers investigate and improve interactions between police and the public.

### Content
This dataset includes 9 Mb of stop data from Rhode Island, covering all of 2013 onwards. Please see the data readme for the full details of the available fields.

### Acknowledgements
This dataset was kindly made available by the Stanford Open Policing Project. If you use it for a research publication, please cite their working paper: E. Pierson, C. Simoiu, J. Overgoor, S. Corbett-Davies, V. Ramachandran, C. Phillips, S. Goel. (2017) “A large-scale analysis of racial disparities in police stops across the United States”.

### Inspiration
* Do men or women speed more often?
* Does gender affect who gets searched during a stop?
* During a search, how often is the driver frisked?
* Which year had the least number of stops?
* How does drug activity change by time of day?
* Do most stops occur at night?

Importing necessary libraries:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use("fivethirtyeight")

### Pandas

It offers powerful, expressive and flexible data structures that make data manipulation and analysis easy.

> pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.

### Matplotlib

> Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

### Seaborn

> Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Reading data:

***read_csv*** : Read a comma-separated values (csv) file into DataFrame. Also supports optionally iterating or breaking of the file into chunks.

In [None]:
data = pd.read_csv("/kaggle/input/stanford-open-policing-project/police_project.csv")

What does each row represent?

***head*** : Return the first n rows. (By default return first 5 rows.)

In [None]:
data.head(5)

***shape*** : Return a tuple representing the dimensionality of the DataFrame.

In [None]:
# What do these numbers mean?

data.shape

***dtypes*** : This returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns. Columns with mixed types are stored with the object dtype.

In [None]:
# What do these type means?

data.dtypes

* What does NaN mean?
#### In computing, NaN, standing for Not a Number, is a member of a numeric data type that can be interpreted as a value that is undefined or unrepresentable, especially in floating-point arithmetic. (Wikipedia)

* Why might a values missing?
#### In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. (Wikipedia)

* Why mark it as NaN? Why not mark it as a 0 or an empty string or a string saying "Unknown"?
#### We should be able to distinguish the missing value than real data. If missing value marked as 'Unknown', string type, and our column has values including 'Unknown', how would be distinguish the data? That's why we marked missing values as NaN.

***isna*** : Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

***sum*** : Return the sum of the values for the requested axis.

In [None]:
# What are these count?

data.isna().sum()

**Data contains 91741 rows, and county_name has 91741 null values.**

# Task 1: Remove the column that only contains missing values

***drop*** : Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. *axis=1 means, ***columns*** should drop.* With setting inplace=True, we make this change permanent.

In [None]:
data.drop('county_name', axis = 1, inplace = True)

In [None]:
# checking the new shape

data.shape

In [None]:
# Checking columns

data.columns

In [None]:
# Alternative way to drop column that only contains missing values:

# data.dropna(axis = 'columns', how = 'all', inplace = True)

### LESSONS : 
* Pay attention to default arguments.
* Check your work.
* There is more than one way to do everything in Pandas.

# TASK 2 : Do men or women speed more often?

***With the code below, Here, we limit our data to data with vialotion entries equals to speeding and count them according to their gender.***

In [None]:
data[data.violation == 'Speeding'].driver_gender.value_counts()

In [None]:
# Alternative way to do the same:

data.loc[data.violation == 'Speeding', 'driver_gender'].value_counts()

* Men are more stopped by the police due to speeding than women.

***groupby*** : A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

In [None]:
data.groupby('driver_gender').violation.count()

* Men are more often stopped by the police than women.

In [None]:
# When a man is pulled over, how often is it for speeding?

data[data.driver_gender == 'M'].violation.value_counts()

In [None]:
# Repeat for a women

data[data.driver_gender == 'F'].violation.value_counts()

In [None]:
# Combines the two lines above

data.groupby('driver_gender').violation.value_counts()

### LESSON :
* There is more than one way to understand a question.

# TASK 3 : Does a gender affect who gets searched during a stop?

In [None]:
# My answer:

data.groupby('driver_gender').search_conducted.value_counts()

In [None]:
# Ignore gender for the moment

data.search_conducted.value_counts(normalize = True)

In [None]:
# How does this work?

data.search_conducted.mean()

In [None]:
data.groupby('driver_gender').search_conducted.mean()

In [None]:
# Include a second factor

data.groupby(['violation', 'driver_gender']).search_conducted.mean()

* Does this prove causation?
#### Yes, looks like we can say that.

### LESSONS:
* Causation is difficult to conclude, so focus on relationship.
* Include all relevnat factors when studying a relationship.

# TASK 4: Why is search_type missing often?

In [None]:
data.isna().sum()

In [None]:
# Maybe search_type is missing when search_counducted is False?

data.search_conducted.value_counts()

In [None]:
# Test that theory, why is the series empty?

data[data.search_conducted == False].search_type.value_counts()

In [None]:
# Value_counts ignores missing values by default

data[data.search_conducted == False].search_type.value_counts(dropna = False)

In [None]:
# When search_conducted is True search_type is never missing.

data[data.search_conducted == True].search_type.value_counts()

***isnull is the same with isna.***

In [None]:
# Alternative

data[data.search_conducted == True].search_type.isnull().sum()

### LESSONS:
* Verify your assumptions about your data.
* Pandas functions ignores missing values by default.

# TASK 5: During a search, how often driver frisked?

In [None]:
# Multiple types are seperated by comas

data.search_type.value_counts(dropna = False)

In [None]:
data['frisk'] = data.search_type == 'Protective Frisk'

In [None]:
data.head()

In [None]:
data.frisk.dtype

In [None]:
# Include exact matched only

data.frisk.sum()

***mean*** : Return the mean of the values for the requested axis.

In [None]:
# Is this the answer?

data.frisk.mean()

In [None]:
# Uses the wrong denominator (includes stops that didn't involve a search)

data.frisk.value_counts(dropna = False)

In [None]:
161 / (91580 + 161)

***str*** : Vectorized string functions for Series and Index.

***split*** : Split strings around given separator/delimiter.

***contains*** : Test if pattern or regex is contained within a string of a Series or Index.

In [None]:
# Inclued partial matches

data['frisk'] = data.search_type.str.contains('Protective Frisk')

In [None]:
data.head()

In [None]:
# Seems about right

data.frisk.sum()

In [None]:
data.frisk.mean()

In [None]:
# str.contains preserved missing values from search_type

data.frisk.value_counts(dropna = False)

In [None]:
# excludes stops that didn't involve a search
274 / (2922 + 274)

### LESSONS:
* Use string methods to find partial matches.
* Use the correct denominator when calculating rates.
* Pandas calculations ignore missing values.
* Apply the "smell test" to your results.

# TASK 6: Which year had the least number of stops?

In [None]:
data.head()

### My Answer :

In [None]:
data['stop_date'].isna().sum()

Let's create a new column named 'stop_year', and w/ str.split function, tell it to split our data where there is '-'. 

In [None]:
data['stop_year'] = data['stop_date'].str.split('-', expand = True)[0]

In [None]:
data['stop_year'].unique()

In [None]:
data.stop_year.value_counts()

### Year 2005 had the least number of stops!

### Answer from Pycon:

***slice(start, stop)*** : Slice substrings from each element in the Series or Index.

In [None]:
# This works, but there is a better way

data.stop_date.str.slice(0, 4).value_counts()

***cat*** : Concatenate strings in the Series/Index with given separator.

***to_datetime*** : Convert argument to datetime.

In [None]:
# make sure you create this column

combined = data.stop_date.str.cat(data.stop_time, sep = ' ')
# Concanate two string type columns with 'cat' method: stop_date, and stop_time

data['stop_datetime'] = pd.to_datetime(combined)
# Converted string type dates into a data time

In [None]:
data.dtypes

In [None]:
data.stop_datetime.dt.year.value_counts()

### LESSONS:
* Consider removing chunks of data that may be biased.
* Use the datetime data type for dates, and times.

# STEP 7: How does drug activity change by time of day?

The entries we want in the 'drugs_related_stop' column.

In [None]:
data.drugs_related_stop.dtypes

In [None]:
# Baseline rate

data.drugs_related_stop.mean()

In [None]:
# Cannot groupby 'hour' unless you create it as a column

data.groupby(data.stop_datetime.dt.hour).drugs_related_stop.mean()

In [None]:
# Line plot by default (by series)

data.groupby(data.stop_datetime.dt.hour).drugs_related_stop.mean().plot()

In [None]:
# Alternative: count drug-related stops by hour

data.groupby(data.stop_datetime.dt.hour).drugs_related_stop.sum().plot()

### LESSONS:
* Be conscious of sorting when plotting.
* Use plots to help you understand trends.
* Create exploratory plot using Pandas one-liners.

# TASK 8: Do most stops occur at night?

In [None]:
data.stop_datetime.dt.hour.value_counts()

In [None]:
data.stop_datetime.dt.hour.value_counts().plot();

In [None]:
data.stop_datetime.dt.hour.value_counts().sort_index().plot();

In [None]:
# Alternative method

data.groupby(data.stop_datetime.dt.hour).stop_date.count().plot();

### LESSONS:
* Be conscious of sorting when plotting.

# TASK 9: Find the bad data in the stop_duration column and fix it

In [None]:
# Mark bad data as missing

data.stop_duration.value_counts(dropna = False)

In [None]:
# What four things are wrong with this code?

# data[(data.stop_duration == 1) | (data.stop_duration == 2)].stop_duration = 'NaN'

In [None]:
# What two things are still wrong with this code?

data[(data.stop_duration == '1') | (data.stop_duration == '2')].stop_duration = 'NaN'

In [None]:
# Assignment statement did not work

data.stop_duration.value_counts()

In [None]:
data.loc[(data.stop_duration == '1') | (data.stop_duration == '2'), 'stop_duration'] = 'NaN'

In [None]:
# Confusing!

data.stop_duration.value_counts(dropna = False)

In [None]:
# Replace 'NaN' string with actual NaN value

import numpy as np

data.loc[data.stop_duration == 'NaN', 'stop_duration'] = np.nan

In [None]:
data.stop_duration.value_counts(dropna = False)

In [None]:
# Alternative method

# data.stop_duration.replace(['1', '2'], value = np.nan, inplace = True)

### LESSONS:
* Ambiguous data should be marked as missing.
* Don't ignore the SettingWithCopyWarning.
* NaN is not a string.

# TASK 10: What is the mean stop_duration for each violation_raw?

In [None]:
# Make sure you create this column

mapping = {'0-15 Min': 8, '16-30 Min': 23, '30+ Min': 45}

data['stop_minutes'] = data.stop_duration.map(mapping)

In [None]:
# Matches value_counts for stop_duration

data.stop_minutes.value_counts()

In [None]:
data.groupby('violation_raw').stop_minutes.mean()

In [None]:
data.groupby('violation_raw').stop_minutes.agg(['count', 'mean'])

### LESSONS:
* Convert strings to numbers for analysis.
* Approximate when necessary.
* Use count with mean to looking for meaningless mean.

# TASK 11: Plot the results of the first groupby from the previous exercise

In [None]:
# What's wrong with this?

data.groupby('violation_raw').stop_minutes.mean().plot()
plt.xticks(rotation = 90);

In [None]:
# How could this be made better?

data.groupby('violation_raw').stop_minutes.mean().plot(kind='bar');

In [None]:
data.groupby('violation_raw').stop_minutes.mean().sort_values().plot(kind='barh')

### LESSONS:
* Don't use line plot to compare categories.
* Be conscious of sorting, and orientation when plotting.

# TASK 12: Compare the age distribution for each violation

In [None]:
data.groupby('violation').driver_age.describe()

In [None]:
# Histograms are excellent for displaying distributions

data.driver_age.plot(kind='hist');

In [None]:
# Similar to a histogram

data.driver_age.value_counts().sort_index().plot();

In [None]:
# Can't use the plot method

data.hist('driver_age', by = 'violation');

In [None]:
# What changed? how is this better or worse?

data.hist('driver_age', by='violation', sharex=True);

In [None]:
# What changed? how is this better or worse?

data.hist('driver_age', by='violation', sharex=True, sharey=True);

### Lessons:

* Use histograms to show distributions.
* Be conscious of axes when using grouped plots.

# TASK 13: Pretend you don't have the driver_age column, and create it from driver_age_raw (and call it new_age)

In [None]:
data.head()

In [None]:
# Appears to be year of stop_date minus driver_age_raw

data.tail()

In [None]:
data['new_age'] = data.stop_datetime.dt.year - data.driver_age_raw

In [None]:
# Compare the distributions

data[['driver_age', 'new_age']].hist();

In [None]:
# Compare the summary statistics (focus on min and max)

data[['driver_age', 'new_age']].describe()

In [None]:
# Calculate how many ages are outside that range

data[(data.new_age < 15) | (data.new_age > 99)].shape

In [None]:
# Raw data given to the researches

data.driver_age_raw.isnull().sum()

In [None]:
# Age computed by the researches (has more missing values)

data.driver_age.isnull().sum()

In [None]:
# What does this tell us? Researchers set driver_age as missing if less than 15 or more than 99?

5621-5327

In [None]:
# driver_age_raw not missing, driver_age missing

data[(data.driver_age_raw.notnull()) & (data.driver_age.isnull())].head()

In [None]:
# Set the ages outside that range as missing

data.loc[(data.new_age < 15) | (data.new_age > 99), 'new_age'] = np.nan

In [None]:
data.new_age.equals(data.driver_age)

### Lessons:

* Don't assume that the head and tail are representative of the data.
* Columns with missing values may still have bad data (driver_age_raw).
* Data cleaning sometimes involves guessing (driver_age).
* Use histograms for a sanity check.