In [2]:
import pandas as pd

In [5]:
sf = pd.read_csv('../data/raw_data/san_francisco.csv', low_memory=False)

In [6]:
la = pd.read_csv('../data/raw_data/los_angeles.csv', low_memory=False)

### To-do:
- Check that columns with same semantic meaning for different cities have the same column name. If not, change column names. [DONE]
- Only retain the intersection of columns shared by all cities. (Or drop cities where there's insufficient data to perform relevant tests.) [Likely dropping LA]
- Do sanity checks to identify non-sense rows
- Histograms of variables

In [8]:
sf.columns

Index(['raw_row_number', 'date', 'time', 'location', 'lat', 'lng', 'district',
       'subject_age', 'subject_race', 'subject_sex', 'type', 'arrest_made',
       'search_conducted', 'search_vehicle', 'search_basis', 'reason_for_stop',
       'raw_search_vehicle_description', 'raw_result_of_contact_description'],
      dtype='object')

In [9]:
la.columns

Index(['raw_row_number', 'date', 'time', 'district', 'region', 'subject_race',
       'subject_sex', 'officer_id_hash', 'type', 'raw_descent_description'],
      dtype='object')

In [15]:
# Columns shared by LA & SF
set(sf.columns).intersection(set(la.columns))

{'date',
 'district',
 'raw_row_number',
 'subject_race',
 'subject_sex',
 'time',
 'type'}

In [10]:
# Columns only in LA
set(la.columns) - set(sf.columns)

{'officer_id_hash', 'raw_descent_description', 'region'}

In [14]:
# Columns only in SF
set(sf.columns) - set(la.columns)

{'arrest_made',
 'citation_issued',
 'contraband_found',
 'lat',
 'lng',
 'location',
 'outcome',
 'raw_result_of_contact_description',
 'raw_search_vehicle_description',
 'reason_for_stop',
 'search_basis',
 'search_conducted',
 'search_vehicle',
 'subject_age',

**SF data takes place from 2007 to 2016. LA data takes place from 2010 to 2018.**

In [22]:
pd.to_datetime(sf.date).min()

Timestamp('2007-01-01 00:00:00')

In [23]:
pd.to_datetime(sf.date).max()

Timestamp('2016-06-30 00:00:00')

In [25]:
pd.to_datetime(la.date).min()

Timestamp('2010-01-01 00:00:00')

In [26]:
pd.to_datetime(la.date).max()

Timestamp('2018-06-23 00:00:00')

**SF only has traffic stops, whereas LA has both.**

In [28]:
la.type.value_counts()

vehicular     4135353
pedestrian    1283048
Name: type, dtype: int64

In [29]:
sf.type.value_counts()

vehicular    905070
Name: type, dtype: int64

Note to self: LA seems useless since we don't know the result of the stop, whether searches occurred, etc. I think there's enough to do a veil of darkness test (which only checks for stops), but not a threshold test.

One interesting thing about the **LA** dataset: they use Brian's idea of an officer id hash!

In [33]:
la.officer_id_hash.value_counts()[:10]

26dad3a37a    21118
f4e01343d9    18064
845ba9b6e4    17851
a39a690ad7    15338
917080a91d    15321
eccd01137d    14070
4fc7b40217    13448
241da4afe3    13271
aa69ad87ba    13003
355e33393e    12975
Name: officer_id_hash, dtype: int64