# Boulder jail bookings charges

In this notebook we'll look specifically at the "charges" column in the dataframe, and try to parse it out.

In [1]:
import pandas as pd

df = pd.read_csv('../data/all-bookings.csv')

In [2]:
df.head()

Unnamed: 0,Name,Booking No,Booked,Location,DOB,Race,Sex,Case No,Arresting Agency,Charge,Arrest Date
0,"ARELLANO-ORDAZ,SIMON",1106625,2011-08-09 22:20:00,BJ INW,1988-04-10,W,M,110010043,BOULDER PD,18-18-405(2)(A)(I). SALE/MFG/DIST/CONT S,2011-08-09
1,"ARELLANO-ORDAZ,SIMON",1106625,2011-08-09 22:20:00,BJ INW,1988-04-10,W,M,110010043,BOULDER PD,18-6-401(7)(B)(I) CHILD ABUSE,2011-08-09
2,"ARELLANO-ORDAZ,SIMON",1106625,2011-08-09 22:20:00,BJ INW,1988-04-10,W,M,110010043,BOULDER PD,42-2-101(1) DRIVING WITHOUT A VA,2011-08-09
3,"ARELLANO-ORDAZ,SIMON",1106625,2011-08-09 22:20:00,BJ INW,1988-04-10,W,M,110010043,BOULDER PD,42-4-203 DROVE DEFECTIVE/UNSA,2011-08-09
4,"BECK,WILLIAM FRANCIS",1106627,2011-08-09 23:51:00,BJ BOK,1948-09-21,W,M,11-1746,UNIVERSITY OF COLORADO,BOULDER MUNI FTA:IMPROP CARE ANAM,2011-08-09


In [5]:
print('''Out of the {} rows, there are {} unique charges.

The most common charges are:
{}
'''.format(df.shape[0], df.Charge.nunique(), df.Charge.value_counts().head()))

Out of the 434423 rows, there are 57203 unique charges.

The most common charges are:
42-4-1301(1)(A) DUI                   21779
42-4-1301(2)(A) DUI PER SE            12344
42  4    1301 DUI                     12238
18-3-204 THIRD DEGREE ASSAULT          9765
18-9-111 HARASSMENT                    8250
Name: Charge, dtype: int64



Based off how DUI shows up twice, I wonder whether there are department-level standards for filling in the charges.

In [65]:
df['date'] = pd.to_datetime(df['Arrest Date'])
df['year'] = df['date'].apply(lambda x: x.year)

In [6]:
df['Arresting Agency'].value_counts()

BOULDER COUNTY SHERIFFS OFFICE    152564
BOULDER PD                        118867
LONGMONT PD                        87293
LAFAYETTE PD                       25000
JAIL MITTS ONLY                    14828
UNIVERSITY OF COLORADO             12026
COLORADO STATE PATROL               9420
LOUISVILLE PD                       7843
ERIE PD                             2091
NEDERLAND MARSHALS OFFICE           1470
PAROLE                              1279
BOULDER COUNTY DRUG TASK FORCE      1214
OTHER                                288
DISTRICT ATTORNEYS OFFICE            122
STATE DIVISION OF WILDLIFE            82
WARD MARSHALS OFFICE                  24
COMMUNITY CORRECTIONS                 12
Name: Arresting Agency, dtype: int64

In [66]:
bcso_df = df[df['Arresting Agency'] == 'BOULDER COUNTY SHERIFFS OFFICE']
bpd_df = df[df['Arresting Agency'] == 'BOULDER PD']
longmont_pd = df[df['Arresting Agency'] == 'LONGMONT PD']

In [67]:
boulder_pds = ['BOULDER PD', 'UNIVERSITY OF COLORADO', 'OTHER', 'BOULDER COUNTY SHERIFFS OFFICE']

boulder_df = df[df['Arresting Agency'].isin(boulder_pds)]

In [68]:
boulder_df[boulder_df['Arresting Agency'] == 'BOULDER COUNTY SHERIFFS OFFICE'].camping.sum()

469

In [69]:
boulder_by_person = boulder_df.groupby('Booking No').first()
boulder_by_person['num_charges'] = boulder_df.groupby('Booking No').Name.count()

boulder_by_person.shape[0], (boulder_by_person['num_charges'] == 1).sum()

(120858, 52682)

In [None]:
boulder_df[boulder_df['Arresting Agency'] == 'BOULDER COUNTY SHERIFFS OFFICE']

In [70]:
boulder_by_person[(boulder_by_person['num_charges'] == 1) & (boulder_by_person.year >= 2009)].camping.sum()

443

In [60]:
boulder_by_person[(boulder_by_person['num_charges'] == 1) & (~boulder_by_person.fta)].groupby('year').camping.sum()

year
1921     0.0
1929     0.0
1940     0.0
1982     0.0
1984     0.0
1985     0.0
1990     0.0
1991     0.0
1992     0.0
1993     0.0
1994     0.0
1995     0.0
1996     0.0
1997     0.0
1998     0.0
1999     0.0
2000    11.0
2001     8.0
2002    10.0
2003     4.0
2004    15.0
2005     9.0
2006    27.0
2007    36.0
2008    33.0
2009    33.0
2010    34.0
2011    34.0
2012    23.0
2013     8.0
2014     7.0
2015     4.0
2016     5.0
2017     1.0
Name: camping, dtype: float64

In [14]:
print('''Top charges from

Boulder County Sheriff's Office:
{}

Boulder PD
{}

Longmont PD
{}'''.format(
        bcso_df.Charge.value_counts().head(),
        bpd_df.Charge.value_counts().head(),
        longmont_pd.Charge.value_counts().head(),
    ))


Top charges from

Boulder County Sheriff's Office:
42-4-1301(1)(A) DUI                   7052
42  4    1301 DUI                     4976
42-4-1301(2)(A) DUI PER SE            3810
42-4-1301(1)(B) DWAI                  3622
16-19-111 WRIT OF HABEAS CORPU        3406
Name: Charge, dtype: int64

Boulder PD
42-4-1301(1)(A) DUI                   9013
42-4-1301(2)(A) DUI PER SE            5743
42  4    1301 DUI                     4459
18-3-204 THIRD DEGREE ASSAULT         2523
18-6-800.3 DOMESTIC VIOLENCE          2505
Name: Charge, dtype: int64

Longmont PD
18-3-204 THIRD DEGREE ASSAULT            2806
18-6-801 DOMESTIC VIOLENCE               2245
18-6-800.3 DOMESTIC VIOLENCE             2147
42-4-1301(1)(A) DUI                      2125
18-6-803.5(2)(A) VIOLATION OF A RESTR    2075
Name: Charge, dtype: int64


The top charges are actually rather consistent across all of the police departments, except DUI is sometimes written as `42-4-1301(1)(A) DUI` and sometimes as `42  4    1301 DUI`

Stripping out all of the non-alphabetical text, what are the most common terms? Or vice versa, what are the most common sections of the code that are cited?

In [7]:
import string

all_caps = set(string.ascii_uppercase)
def preserve_capitalized_characters(s):
    for i, c in enumerate(s):
        try:
            if c in all_caps and s[i+1] in all_caps:
                break
        except IndexError:
            break
    
    return s[i:].strip()

df.Charge.head().apply(preserve_capitalized_characters)

0                 SALE/MFG/DIST/CONT S
1                          CHILD ABUSE
2                 DRIVING WITHOUT A VA
3                 DROVE DEFECTIVE/UNSA
4    BOULDER MUNI FTA:IMPROP CARE ANAM
Name: Charge, dtype: object

In [8]:
df['charge_text'] = df.Charge.apply(preserve_capitalized_characters)

In [9]:
print('''The most common charge texts are:

{}
'''.format(df.charge_text.value_counts().head(20)))

The most common charge texts are:

DUI                     35942
DUI PER SE              12451
DOMESTIC VIOLENCE       12265
HARASSMENT              10748
THIRD DEGREE ASSAULT     9785
DROVE VEHICLE WHEN L     8311
VIOLATION OF A RESTR     6904
FAILED TO PRESENT EV     5600
DWAI                     4989
LANE USAGE VIOLATION     4866
CARELESS DRIVING         4832
DRUG PARAPHERNALIA-P     4308
OBSTRUCTING A PEACE      4282
COMPULSARY INSURANCE     3912
THEFT                    3906
WRIT OF HABEAS CORPU     3455
TRESPASS FIRST DEGRE     3375
FORGERY                  3173
ARREST OF PROBATIONE     3047
ASSAULT IN THE 3RD D     3012
Name: charge_text, dtype: int64



In [10]:
import re

code_regex = re.compile(r'(\d+)\D+(\d+)\D+(\d+)\D+(\d+?)?')
def extract_legal_code(s):
    match = code_regex.search(s)
    if match:
        return tuple([int(x) for x in match.groups() if x])

df['legal_code_parts'] = df.Charge.apply(extract_legal_code)

In [11]:
print('''Out of {} bookings, there were {} identifiable legal codes. This accounted for {:.2f} percent of the bookings.

The most common parts of legal codes are:

{}'''.format(
        df.shape[0],
        len(df['legal_code_parts'].unique()),
        df['legal_code_parts'].count() / df.shape[0] * 100,
        df['legal_code_parts'].value_counts().head(20)
    ))

Out of 434423 bookings, there were 1213 identifiable legal codes. This accounted for 91.13 percent of the bookings.

The most common parts of legal codes are:

(42, 4, 1301, 1)    28346
(42, 4, 1301)       17678
(18, 1, 3, 7)       17031
(16, 2, 110, 1)     15101
(42, 4, 1301, 2)    12651
(42, 2, 138)        12132
(16, 2, 110, 2)     11768
(18, 9, 111)        11251
(18, 3, 204)         9880
(18, 4, 401, 2)      9623
(18, 6, 803, 5)      8920
(18, 6, 800, 3)      7964
(16, 11, 205)        7597
(16, 2, 110)         7507
(18, 18, 405, 2)     6655
(42, 4, 1007)        6246
(42, 4, 1409, 3)     5647
(42, 4, 1402)        4847
(16, 19, 111)        4553
(18, 18, 428, 1)     4410
Name: legal_code_parts, dtype: int64


In [70]:
df['legal_code_no_1'] = df['legal_code_parts'].str.get(0)
df['legal_code_no_2'] = df['legal_code_parts'].str.get(1)
df['legal_code_no_3'] = df['legal_code_parts'].str.get(2)
df['legal_code_no_4'] = df['legal_code_parts'].str.get(3)

In [12]:
df.head()

Unnamed: 0,Name,Booking No,Booked,Location,DOB,Race,Sex,Case No,Arresting Agency,Charge,Arrest Date,charge_text,legal_code_parts
0,"ARELLANO-ORDAZ,SIMON",1106625,2011-08-09 22:20:00,BJ INW,1988-04-10,W,M,110010043,BOULDER PD,18-18-405(2)(A)(I). SALE/MFG/DIST/CONT S,2011-08-09,SALE/MFG/DIST/CONT S,"(18, 18, 405, 2)"
1,"ARELLANO-ORDAZ,SIMON",1106625,2011-08-09 22:20:00,BJ INW,1988-04-10,W,M,110010043,BOULDER PD,18-6-401(7)(B)(I) CHILD ABUSE,2011-08-09,CHILD ABUSE,"(18, 6, 401, 7)"
2,"ARELLANO-ORDAZ,SIMON",1106625,2011-08-09 22:20:00,BJ INW,1988-04-10,W,M,110010043,BOULDER PD,42-2-101(1) DRIVING WITHOUT A VA,2011-08-09,DRIVING WITHOUT A VA,"(42, 2, 101, 1)"
3,"ARELLANO-ORDAZ,SIMON",1106625,2011-08-09 22:20:00,BJ INW,1988-04-10,W,M,110010043,BOULDER PD,42-4-203 DROVE DEFECTIVE/UNSA,2011-08-09,DROVE DEFECTIVE/UNSA,"(42, 4, 203)"
4,"BECK,WILLIAM FRANCIS",1106627,2011-08-09 23:51:00,BJ BOK,1948-09-21,W,M,11-1746,UNIVERSITY OF COLORADO,BOULDER MUNI FTA:IMPROP CARE ANAM,2011-08-09,BOULDER MUNI FTA:IMPROP CARE ANAM,


In [51]:
df['camping'] = df['charge_text'].str.contains('CAMP') | df['charge_text'].str.contains('5-6-10')

In [85]:
df['boulder_muni'] = df['charge_text'].str.contains('BOULDER MUNI')

In [14]:
df['fta'] = df['charge_text'].str.contains('FTA')
df['ftc'] = df['charge_text'].str.contains('FTC')

In [86]:
df[df.boulder_muni]['Arresting Agency'].value_counts()

BOULDER PD                        15417
BOULDER COUNTY SHERIFFS OFFICE     3406
UNIVERSITY OF COLORADO              782
LONGMONT PD                         387
JAIL MITTS ONLY                     125
LAFAYETTE PD                        106
LOUISVILLE PD                        57
NEDERLAND MARSHALS OFFICE            49
COLORADO STATE PATROL                38
ERIE PD                              10
BOULDER COUNTY DRUG TASK FORCE        5
OTHER                                 3
PAROLE                                2
STATE DIVISION OF WILDLIFE            1
Name: Arresting Agency, dtype: int64

In [89]:
df[df.boulder_muni & ~df.fta & ~df.ftc].Charge.value_counts()

BOULDER MUNI TRESPASSING             1384
BOULDER MUNI TRESPASS                 597
BOULDER MUNI POSN,CONS                364
BOULDER MUNI USE OF FIGHTING WORD     359
BOULDER MUNI RESISTING ARREST         354
BOULDER MUNI POSS/CONS ALCOHOL        297
BOULDER MUNI OBSTRUCTION              275
BOULDER MUNI CAMP W/O PERMISSION      187
BOULDER MUNI POSS ALCOHOL PUBLIC      176
BOULDER MUNI USE OF FIGHT WORDS       169
BOULDER MUNI CONTEMPT OF COURT        165
BOULDER MUNI PHYSICAL HARASSMENT      132
BOULDER MUNI BRAWLING                 129
BOULDER MUNI URINATING IN PUBLIC      120
BOULDER MUNI POSS/CONS OF ALC         109
BOULDER MUNI POSS/CONS OF ALCOHOL      96
BOULDER MUNI OBSTRUCTING POLICE        95
BOULDER MUNI CAMPING W/O CONSENT       93
BOULDER MUNI LITTERING                 90
BOULDER MUNI USE OF FIGHTING WRDS      85
BOULDER MUNI THREAT BODILY INJURY      85
BOULDER MUNI 3RD DEGREE ASSAULT        85
BOULDER MUNI OBSTRUCTING               73
BOULDER MUNI SMOKING PROHIBITED   

In [98]:
df[df.camping & ~df.fta & ~df.ftc & ~df.Charge.str.contains("CAMPUS")].Charge.value_counts()

1095

In [71]:
df.to_csv('../data/all-bookings-with-charges.csv', index=False)