#Data Loading, Cleaning, and Normalization
We need to load the data from .csv into Postgres.  We also need to normalize the data to make analysis easy.  We'll use Pandas to deal with the .csv loading and data storage.

Files we need to load:
- cfs_2014_inmain.csv (CFS data)
- cfs_xxx2014_incilog.csv (CFS event data -- one for each month)
- cfs_2014_lwmain.csv (incident data)
- cfs_2014_lwmodop.csv (incident modus operandi data)
- LWMAIN.THING.csv (incident lookup tables, where THING is one of the following: CSSTATUS, EMDIVISION, EMSECTION, EMUNIT, INSTSTATS, PREMISE, or WEAPON)

Columns in files:
*inmain*

inci_id	calltime	calldow	case_id	callsource	primeunit	firstdisp	streetno	streetonly	street	citydesc	zip	crossroad1	crossroad2	geox	geoy	service	agency	statbeat	district	ra	business	naturecode	nature	priority	rptonly	cancelled	notes	timeroute	secs2rt	timefini	secs2fn	firstdtm	secs2di	secsrt2dsp	secsfi2dsp	firstenr	secs2en	secsdi2en	firstarrv	secs2ar	secsdi2ar	firsttran	secs2tr	secsar2tr	lastclr	secs2lc	secsar2lc	secstr2lc	timeclose	reptaken	closecode	closecomm

In [3]:
import pandas as pd
from sqlalchemy import create_engine # database connection
import datetime as dt
from IPython.display import display

In [10]:
display(pd.read_csv('csv_data/cfs_2014_inmain.csv', nrows=2).head())

Unnamed: 0,inci_id,calltime,calldow,case_id,callsource,primeunit,firstdisp,streetno,streetonly,street,...,secs2tr,secsar2tr,lastclr,secs2lc,secsar2lc,secstr2lc,timeclose,reptaken,closecode,closecomm
0,2014000002,1/1/14 0:00:22,4,,PHONE,BK2,BK2,301,S ELM ST,301 S ELM ST,...,0,0,1/1/14 0:04:20,238,0,0,1/1/14 0:04:22,,10,
1,2014000003,1/1/14 0:00:40,4,14000001.0,SELF,B200,B200,1610,GUESS RD,1610 GUESS RD,...,0,0,1/1/14 0:15:57,918,917,0,1/1/14 0:15:59,B200,1,


In [11]:
engine = create_engine('postgresql://localhost/cfs')

##cfs_2014_inmain.csv

In [None]:
start = dt.datetime.now()
chunksize = 20000
j = 0
index_start = 1

for df in pd.read_csv('311_100M.csv', chunksize=chunksize, iterator=True, encoding='utf-8'):
    
    df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) # Remove spaces from columns

    df['CreatedDate'] = pd.to_datetime(df['CreatedDate']) # Convert to datetimes
    df['ClosedDate'] = pd.to_datetime(df['ClosedDate'])

    df.index += index_start

    # Remove the un-interesting columns
    columns = ['Agency', 'CreatedDate', 'ClosedDate', 'ComplaintType', 'Descriptor',
               'CreatedDate', 'ClosedDate', 'TimeToCompletion',
               'City']

    for c in df.columns:
        if c not in columns:
            df = df.drop(c, axis=1)    

    
    j+=1
    print '{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunksize)

    df.to_sql('data', disk_engine, if_exists='append')
    index_start = df.index[-1] + 1

#Initial Exploration

Initial exploration of the Durham PD CFS data using non-robust .csv reading code.  Has windows line endings, so have to open the file in universal mode to account for that.

In [44]:
from pprint import pprint

first = True
incilog_header = ""
incilog = []

with open("cfs_mar2015_incilog.csv","rU") as f:
    for line in f.readlines():
        if first:
            incilog_header = line
            first = False
        else:
            incilog.append([datum.strip() for datum in line.split(',')])

In [50]:
pprint(incilog[0])

['63260886',
 'RPTO',
 'Report Only',
 '3/27/15 15:22:41',
 '55361',
 '2014412231',
 'B125',
 'R',
 '997150',
 '']


In [6]:
first = True
inmain_header = ""
inmain = []

with open("cfs_mar2015_inmain.csv","rU") as f:
    for line in f.readlines():
        if first:
            inmain_header = line
            first = False
        else:
            inmain.append([datum.strip() for datum in line.split(',')])

In [49]:
pprint(inmain[0])

['2015087068',
 '3/1/15 0:00:32',
 '1',
 '',
 'E911',
 'C413',
 'C424',
 '617',
 'HOPE AVE',
 '617 HOPE AVE',
 'DURHAM',
 '27707',
 'ANACOSTA ST',
 'LINCOLN ST',
 '2030390.25',
 '807470.19',
 'LAW',
 'DPD',
 '412',
 'D4',
 'STH',
 '',
 'ASSIST',
 'ASSIST PERSON',
 '4',
 '0',
 '0',
 'actve dist...child advised mom and aunt aruging  [03/01/15 00:01:14 SMITHK]  WRLS  [03/01/15 00:01:19 SMITHK]  NO PHASE 2.....EHX SHOWS 500 MAHONE POSS APT1  [03/01/15 00:04:09 SMITHK]  [EPD] Aborted by Law Priority with code: 1. Caller hung up  [03/01/15 00:07:42 SMITHK]  {C413} NEED BETTER LOCATION  [03/01/15 00:09:50 ROSSA]',
 '3/1/15 0:04:11',
 '219',
 '3/1/15 0:08:11',
 '459',
 '3/1/15 0:04:53',
 '261',
 '42',
 '0',
 '3/1/15 0:04:53',
 '261',
 '0',
 '3/1/15 0:09:32',
 '540',
 '279',
 'NULL',
 '0',
 '0',
 '3/1/15 0:34:42',
 '2050',
 '1510',
 '0',
 '3/1/15 0:34:43',
 '',
 '10',
 '']


The dispatcher's remarks are all concatenated together, separated by brackets containing what appear to be timestamps and names.  We'll use regexes to pull these apart.

In [51]:
import re

timestamp_expr = re.compile("\[(\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) (.+?)\]")

test_str = "actve dist...child advised mom and aunt aruging  [03/01/15 00:01:14 SMITHK]  \
WRLS  [03/01/15 00:01:19 SMITHK]  \
NO PHASE 2.....EHX SHOWS 500 MAHONE POSS APT1  [03/01/15 00:04:09 SMITHK]  \
[EPD] Aborted by Law Priority with code: 1. Caller hung up  [03/01/15 00:07:42 SMITHK]  \
{C413} NEED BETTER LOCATION  [03/01/15 00:09:50 ROSSA]"

pprint(timestamp_expr.split(test_str))

['actve dist...child advised mom and aunt aruging  ',
 '03/01/15 00:01:14',
 'SMITHK',
 '  WRLS  ',
 '03/01/15 00:01:19',
 'SMITHK',
 '  NO PHASE 2.....EHX SHOWS 500 MAHONE POSS APT1  ',
 '03/01/15 00:04:09',
 'SMITHK',
 '  [EPD] Aborted by Law Priority with code: 1. Caller hung up  ',
 '03/01/15 00:07:42',
 'SMITHK',
 '  {C413} NEED BETTER LOCATION  ',
 '03/01/15 00:09:50',
 'ROSSA',
 '']


This is a function we can use to get the data for each individual note.

In [52]:
import datetime

def split_notes(notes):
    """
    Return a list of 3-tuples.  Each tuple represents a single note and contains the timestamp, the note-taker, and
    the text of the note.
    """
    tuples = []
    regex_split = timestamp_expr.split(notes)[:-1]  # get rid of the last empty string created by the split
    for i in range(0,len(regex_split),3):
        note = regex_split[i].strip()
        timestamp = datetime.datetime.strptime(regex_split[i+1], "%m/%d/%y %H:%M:%S")
        notetaker = regex_split[i+2]
        tuples.append((note,timestamp,notetaker))
    return tuples

pprint(split_notes(test_str))

[('actve dist...child advised mom and aunt aruging',
  datetime.datetime(2015, 3, 1, 0, 1, 14),
  'SMITHK'),
 ('WRLS', datetime.datetime(2015, 3, 1, 0, 1, 19), 'SMITHK'),
 ('NO PHASE 2.....EHX SHOWS 500 MAHONE POSS APT1',
  datetime.datetime(2015, 3, 1, 0, 4, 9),
  'SMITHK'),
 ('[EPD] Aborted by Law Priority with code: 1. Caller hung up',
  datetime.datetime(2015, 3, 1, 0, 7, 42),
  'SMITHK'),
 ('{C413} NEED BETTER LOCATION',
  datetime.datetime(2015, 3, 1, 0, 9, 50),
  'ROSSA')]


Questions we need answered about some of the fields:

inmain
- can we get any more info about the cases from the case_id? (case_id: case number, if a report is generated from the call)
- callsources: E911, ALARM self-explanatory, but SELF, PHONE and RADIO?
- primeunit: what are the responsibilities of the prime unit?
- service is always LAW, agency is always DPD
- nature/naturecode: differences between HANG UP, HANG UP WIRELESS PHASE 1, and HANG UP WIRELESS PHASE 2?
- notes: need abbreviations used, can maybe get some of them from the nature codes
- meanings of closecodes?

incilog
- each unit = one officer? any additional info we can get from unitper table, such as officer pay to more accurately estimate cost?

assuming "code_agcy" for all since that matches up best with the data
lwmain.csstatus
- which code (code_fbi, code_sbi, code_agcy) is the one corresponding to the csstatus foreign key? (assuming code_agcy) are any columns other than descriptn informative?

same questions for lwmain.emdivision, emsection, emunit, invststats, premise, weapon

 (eventually) Here we'll create the database schema to store the CFS data in a more structured way.

In [None]:
"""
# I think we're actually going to use postgres -- maybe not worry about the specific db implementation for now

import sqlite3
conn = sqlite3.connect('dpd_cfs.db')
c = conn.cursor()

CREATE_INCIDENT = \
\"""
CREATE TABLE IF NOT EXISTS incident (
    inci_id INTEGER PRIMARY KEY,
    calltime TIMESTAMP,
    calldow INTEGER,
    case_id INTEGER,
    callsource VARCHAR,
    primeunit VARCHAR,
    firstdisp VARCHAR,
    streetno INTEGER,
    streetonly VARCHAR,
    street VARCHAR,
    citydesc VARCHAR,
    zip INTEGER,
    crossroad1 VARCHAR,
    crossroad2 VARCHAR,
    geox DOUBLE,
    geoy DOUBLE,
    service VARCHAR,
    agency VARCHAR,
    
    )
\"""

c.execute('')
"""