# Basic Statistical Analysis

The most fundamental analysis we can do looks at stop rates per capita. How many times per year is a person stopped by the police, on average? How does it vary by race?

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

from astral import Observer
from astral.sun import sun
from pytz import timezone
from timezonefinder import TimezoneFinder
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [2]:
sf = pd.read_csv('../data/raw_data/san_francisco.csv', low_memory=False)

In [3]:
for col in sf.columns:
    print(col)

raw_row_number
date
time
location
lat
lng
district
subject_age
subject_race
subject_sex
type
arrest_made
citation_issued
outcome
contraband_found
search_conducted
search_vehicle
search_basis
reason_for_stop
raw_search_vehicle_description
raw_result_of_contact_description


In [5]:
sf.shape

(905070, 22)

Let's add up all the 900,000+ stops by race of the person stopped

In [5]:
sf["subject_race"].value_counts()

white                     372318
asian/pacific islander    157684
black                     152196
hispanic                  116014
other                     106858
Name: subject_race, dtype: int64

In [6]:
stop_by_race = sf["subject_race"].value_counts()

In [7]:
stop_by_race_frac = stop_by_race / sf.shape[0]

In [8]:
sf_pop_list = [.402, .36+.005, .056, .152, .007+.045
              ] # https://www.census.gov/quickfacts/sanfranciscocountycalifornia

In [9]:
sf_pop = pd.Series(sf_pop_list, index = ['white', 'asian/pacific islander', 'black', 'hispanic', 'other'])

In [10]:
sf_pop

white                     0.402
asian/pacific islander    0.365
black                     0.056
hispanic                  0.152
other                     0.052
dtype: float64

In [11]:
stop_by_race_frac / sf_pop

white                     1.023307
asian/pacific islander    0.477323
black                     3.002846
hispanic                  0.843305
other                     2.270500
dtype: float64

So, on average, Blacks are stopped 3 times per year, whites are stopped once per year, hispanics are stopped 0.84 times per year, and Asians are stopped 0.48 times per year.

### Complications to this quick analysis

There are a lot of ways that this quick analysis could miss something. The first one is that it used the population of the city. What if there were a lot of people passing through the town, perhaps working there but not living there? If the demographics of the commuters were different than the residents, and if the police were equally likely to pull anyone over, then we would see differences in the stop rates for completely innocent reasons.

How many Black commuters would there need to be in order to make the rates match?

In [12]:
pop_by_race = 881549*sf_pop

In [14]:
white_stops_per_person = stop_by_race["white"] / pop_by_race["white"]

In [15]:
white_stops_per_person

1.0506099820934258

In [19]:
black_commuters_required = white_stops_per_person * stop_by_race["black"]

In [20]:
black_commuters_required - pop_by_race["black"]

110531.89283469104

So, if 110,000 Blacks represented the only commuters into San Francisco, that would be enough to make up for the discrepancy. Of course, there are no where near 110,000 Black commuters to SF. And the real number will be even larger, because of course there are commuters of other races. We could work out the detailed numbers, but there's no need -- the explanation that the stop discrepancy is due to non-residents cannot possibly be true.

Of course, there are many other possible explanations. Perhaps people of different races commit traffic violations at different rates. To address these complications, we'll need more sophisticated methods. But for San Francisco in 2019, the apparent discrepancy is large. It's clear that any innocent explanation is going to have to do a lot of work to prove itself. 