# Joining Boulder jail bookings data

We have downloaded 16 years of jail rolls from the Boulder County Sheriff's [website](https://www.bouldercounty.org/safety/jail/listing-and-booking-reports/). However, each day the records are in a separate excel spreadsheet. The goal of this notebook is to read every spreadsheet, and combine them into a single spreadsheet.

In [16]:
import pandas as pd

example_spreadsheet = pd.read_excel('../data/raw/2017-07-07.xlsx')

In [6]:
print('This spreadsheet has {} rows.'.format(example_spreadsheet.shape[0]))

example_spreadsheet.head(3)

This spreadsheet has 31 rows.


Unnamed: 0,Name,Booking No,Booked,Location,DOB,Race,Sex,Case No,Arresting Agency,Charge,Arrest Date
0,"AGUIRRE,ANTONIO MIGUEL",1705254,2017-07-07 21:09:00,BJ INW,1994-12-23,W,M,201702044,LAFAYETTE PD,16-19-103 FUGITIVES FROM JUSTI,2017-07-07
1,"ALI,HASSAN",1705237,2017-07-07 01:03:00,BJ MED,1988-09-12,W,M,170008337,BOULDER PD,18-3-206(1)(A)(B) FELONY MENACING-REAL,2017-06-06
2,"ALI,HASSAN",1705237,2017-07-07 01:03:00,BJ MED,1988-09-12,W,M,170008337,BOULDER PD,18-4-501........... CRIMINAL MISCHIEF $1,2017-06-06


In [17]:
import os

dataframes = []

for i, d in enumerate(os.scandir('../data/raw')):
    if not d.path.endswith('.xlsx'):
        continue
    
    df = pd.read_excel(d.path)
    dataframes.append(df)
    
    if i % 100 == 0:
        print('.', end='')

................................................................

In [18]:
df = pd.concat(dataframes)

In [19]:
print('''The combined data frame has {} rows.

There are:
{} unique bookings
{} unique case numbers
{} unique arrest dates.
{} unique name + birthdays
{} locations
{} arresting agencies'''.format(
    df.shape[0],
        df['Booking No'].nunique(),
        df['Case No'].nunique(),
        df['Arrest Date'].nunique(),
        (df['Name'] + df['DOB'].apply(str)).nunique(),
        df['Location'].nunique(),
        df['Arresting Agency'].nunique()
))

The combined data frame has 434423 rows.

There are:
167633 unique bookings
146528 unique case numbers
6690 unique arrest dates.
95495 unique name + birthdays
70 locations
17 arresting agencies


In [20]:
df.to_csv('../data/all-bookings.csv', index=False)