#Data Loading, Cleaning, and Normalization
Now that we have a better idea of what the data contains, we're going to load it in a format that will be more efficient for analysis.  Changes to make:

 - account for 4 digit years in the note-splitting regex
 - account for null author in the note-splitting regex (ex. see caller at hq at the desk  [08/01/14 15:58:37 WEAVERM]  [EPD] Aborted by Law Priority with code: 1. Caller Hung Up  [08/01/14 16:11:23 ] ) and null text (ex.  old oxford  dearborn and roxboro needs salt and sand trucks  [03/03/14 22:33:07 HOLLANDJ]  [03/03/14 22:34:30 HOLLANDJ])
 - plural table names for ORM
 - turn all blanks into null
 
##call
 - split call.call_source into new table
 - split call.primary_unit, call.reporting_unit into new table (get list of all units from call_log + '2091')
 - filter out call_time < 2014 (that one weird row from 2007)
 - call.first_dispatched refers to new unit table
 - can create list of streets from call; split call.street_name, call.crossroad1, call.crossroad2, incident.street_name into new table (get list of all streets from call.street_name)
 - split city_desc into new table
 - drop service, agency
 - ditch nature code, split nature_desc into own table

##call_log
 - get full transaction_descs, split them out?  too many contain unit name now, code<->desc is not 1:1
 - ditch unitper_id?
 
##note
 - split author out into own table
 
##incident
 - link city to new city_desc table
 - convert premise_code, weapon_code, bureau_code, division_code, unit_code, investigation_status_code, case_status_code, to int
 - drop lwchrgid, charge_seq
 
We'll load each table as a two-step process.  First, we scan each table and accumulate a set for each lookup table associated.  We'll then load these lookup tables.  Second, we'll load the main table.  This should be less complicated than trying to accumulate the lookup tables during the chunked-out load of the main table.

Main table: call
Lookup tables: call_unit, city, nature
Lookup tables loaded separately: call_source

Main table: note
Lookup tables: note_author

Main table: call_log
Lookup tables: transaction
Lookup tables loaded separately: close_code

Main table: incident
Lookup tables: city (should already have everything from call), ucr_descr
Lookup tables loaded separately: premise, weapon, bureau, division, unit, investigation_status, case_status

Main table: modus_operandi
Lookup tables: mo_item

We'll use dataset to stuff the data into a local instance of postgres.

In [40]:
import dataset
import datetime as dt
import pandas as pd
from sqlalchemy import create_engine

We need to create the tables before touching the data so they have all the proper constraints.

#Database DDL

Code to create the database schema is below.

In [41]:
# CHANGE CREDENTIALS AS APPROPRIATE
db = dataset.connect('postgresql://jnance@localhost:5432/cfs')
engine = create_engine('postgresql://jnance@localhost:5432/cfs')

In [110]:
def reset_db():
    """
    Remove and recreate tables to prepare for reloading the db
    """
    db.query("DROP TABLE IF EXISTS note CASCADE;")
    db.query("DROP TABLE IF EXISTS note_author CASCADE;")
    db.query("DROP TABLE IF EXISTS call CASCADE;")
    db.query("DROP TABLE IF EXISTS call_source CASCADE;")
    db.query("DROP TABLE IF EXISTS call_unit CASCADE;")
    db.query("DROP TABLE IF EXISTS city CASCADE;")
    db.query("DROP TABLE IF EXISTS call_log CASCADE;")
    db.query("DROP TABLE IF EXISTS transaction CASCADE;")
    db.query("DROP TABLE IF EXISTS close_code CASCADE;")
    db.query("DROP TABLE IF EXISTS ucr_descr CASCADE;")
    db.query("DROP TABLE IF EXISTS incident CASCADE;")
    db.query("DROP TABLE IF EXISTS modus_operandi CASCADE;")
    db.query("DROP TABLE IF EXISTS mo_item CASCADE;")
    db.query("DROP TABLE IF EXISTS bureau CASCADE;")
    db.query("DROP TABLE IF EXISTS case_status CASCADE;")
    db.query("DROP TABLE IF EXISTS division CASCADE;")
    db.query("DROP TABLE IF EXISTS unit CASCADE;")
    db.query("DROP TABLE IF EXISTS investigation_status CASCADE;")
    db.query("DROP TABLE IF EXISTS weapon CASCADE;")
    db.query("DROP TABLE IF EXISTS weapon_group CASCADE;")
    db.query("DROP TABLE IF EXISTS premise CASCADE;")
    db.query("DROP TABLE IF EXISTS premise_group CASCADE;")
    db.query("DROP TABLE IF EXISTS NATURE CASCADE;")

    
    db.query("""
    CREATE TABLE ucr_descr
    (
      ucr_descr_id serial NOT NULL,
      short_descr text,
      long_descr text,
      CONSTRAINT ucr_descr_pk PRIMARY KEY (ucr_descr_id)
    );
    """)
    
    db.query("""
    CREATE TABLE bureau
    (
      bureau_id serial NOT NULL,
      descr text,
      CONSTRAINT bureau_pk PRIMARY KEY (bureau_id)
    );
    """)
    
    db.query("""
    CREATE TABLE division
    (
      division_id serial NOT NULL,
      descr text,
      CONSTRAINT division_pk PRIMARY KEY (division_id)
    );
    """)
    
    db.query("""
    CREATE TABLE investigation_status
    (
      investigation_status_id serial NOT NULL,
      descr text,
      CONSTRAINT investigation_status_pk PRIMARY KEY (investigation_status_id)
    );
    """)
    
    db.query("""
    CREATE TABLE case_status
    (
      case_status_id serial NOT NULL,
      descr text,
      CONSTRAINT case_status_pk PRIMARY KEY (case_status_id)
    );
    """)
    
    db.query("""
    CREATE TABLE unit
    (
      unit_id serial NOT NULL,
      descr text,
      CONSTRAINT unit_pk PRIMARY KEY (unit_id)
    );
    """)
    
    db.query("""
    CREATE TABLE weapon_group
    (
      weapon_group_id serial NOT NULL,
      descr text,
      CONSTRAINT weapon_group_pk PRIMARY KEY (weapon_group_id)
    );
    """)
    
    db.query("""
    CREATE TABLE premise_group
    (
      premise_group_id serial NOT NULL,
      descr text,
      CONSTRAINT premise_group_pk PRIMARY KEY (premise_group_id)
    );
    """)
    
    db.query("""
    CREATE TABLE weapon
    (
      weapon_id serial NOT NULL,
      descr text,
      weapon_group_id int,
      CONSTRAINT weapon_pk PRIMARY KEY (weapon_id),
      CONSTRAINT weapon_group_weapon_fk FOREIGN KEY (weapon_group_id) REFERENCES weapon_group (weapon_group_id)
    );
    """)
    
    db.query("""
    CREATE TABLE premise
    (
      premise_id serial NOT NULL,
      descr text,
      premise_group_id int,
      CONSTRAINT premise_pk PRIMARY KEY (premise_id),
      CONSTRAINT premise_group_premise_fk FOREIGN KEY (premise_group_id) REFERENCES premise_group (premise_group_id)
    );
    """)
    
    db.query("""
    CREATE TABLE city
    (
      city_id serial NOT NULL,
      descr text,
      CONSTRAINT city_pk PRIMARY KEY (city_id)
    );
    """)
    
    db.query("""
    CREATE TABLE incident
    (
      incident_id bigint NOT NULL,
      case_id bigint UNIQUE,
      time_filed timestamp without time zone,
      month_filed int,
      week_filed int,
      dow_filed int,
      street_num int,
      street_name text,
      city_id int,
      zip int,
      geox double precision,
      geoy double precision,
      beat text,
      district text,
      sector text,
      premise_id int,
      weapon_id int,
      domestic text,
      juvenile text,
      gang_related text,
      emp_bureau_id int,
      emp_division_id int,
      emp_unit_id int,
      num_officers int,
      investigation_status_id int,
      investigator_unit_id int,
      case_status_id int,
      ucr_code int,
      ucr_descr_id int,
      attempted_or_committed boolean,
      
      CONSTRAINT incident_pk PRIMARY KEY (incident_id),
      
      CONSTRAINT case_status_incident_fk
        FOREIGN KEY (case_status_id) REFERENCES case_status (case_status_id),
      CONSTRAINT bureau_incident_fk
        FOREIGN KEY (emp_bureau_id) REFERENCES bureau (bureau_id),
      CONSTRAINT division_incident_fk
        FOREIGN KEY (emp_division_id) REFERENCES division (division_id),
      CONSTRAINT unit_incident_emp_fk
        FOREIGN KEY (emp_unit_id) REFERENCES unit (unit_id),
      CONSTRAINT unit_incident_investigator_fk
        FOREIGN KEY (investigator_unit_id) REFERENCES unit (unit_id),
      CONSTRAINT investigation_status_incident_fk
        FOREIGN KEY (investigation_status_id) REFERENCES investigation_status (investigation_status_id),
      CONSTRAINT premise_incident_fk
        FOREIGN KEY (premise_id) REFERENCES premise (premise_id),
      CONSTRAINT weapon_incident_fk
        FOREIGN KEY (weapon_id) REFERENCES weapon (weapon_id),
      CONSTRAINT city_incident_fk
        FOREIGN KEY (city_id) REFERENCES city (city_id),
      CONSTRAINT ucr_descr_incident_fk
        FOREIGN KEY (ucr_descr_id) REFERENCES ucr_descr (ucr_descr_id)
    );
    """)
    
    db.query("""
    CREATE TABLE mo_item
    (
      mo_item_id int NOT NULL,
      item_descr text,
      mo_group_id int NOT NULL,
      group_descr text,
      CONSTRAINT mo_item_pk PRIMARY KEY (mo_item_id, mo_group_id)
    );
    """)
    
    db.query("""
    CREATE TABLE modus_operandi
    (
      incident_id bigint,
      mo_id bigint,
      mo_group_id int,
      mo_item_id int,
      
      CONSTRAINT mo_pkey PRIMARY KEY (mo_id),
      
      CONSTRAINT incident_modus_operandi_fk FOREIGN KEY (incident_id) REFERENCES incident (incident_id),
      CONSTRAINT mo_item_modus_operandi_fk FOREIGN KEY (mo_item_id, mo_group_id) 
        REFERENCES mo_item (mo_item_id, mo_group_id)
    );
    """)
    
    db.query("""
    CREATE TABLE call_source
    (
      call_source_id serial NOT NULL,
      descr text,
      CONSTRAINT call_source_pk PRIMARY KEY (call_source_id)
    );
    """)
    
    db.query("""
    CREATE TABLE call_unit
    (
      call_unit_id serial NOT NULL,
      descr text,
      CONSTRAINT call_unit_pk PRIMARY KEY (call_unit_id)
    );
    """)
    
    db.query("""
    CREATE TABLE close_code
    (
      close_code_id serial NOT NULL,
      descr text,
      CONSTRAINT close_code_pk PRIMARY KEY (close_code_id)
    );
    """)
    
    db.query("""
    CREATE TABLE nature
    (
      nature_id serial NOT NULL,
      descr text,
      CONSTRAINT nature_pk PRIMARY KEY (nature_id)
    );
    """)
    
    db.query("""
    CREATE TABLE call
    (
      call_id bigint NOT NULL,
      month int,
      week int,
      day_of_week int,
      hour int,
      case_id bigint,
      call_source_id int,
      primary_unit_id int,
      first_dispatched_id int,
      reporting_unit_id int,
      street_num int,
      street_name text,
      city_id int,
      zip int,
      crossroad1 text,
      crossroad2 text,
      geox double precision,
      geoy double precision,
      beat text,
      district text,
      sector text,
      business text,
      nature_id int,
      priority text,
      report_only boolean,
      cancelled boolean,
      time_received timestamp without time zone,
      time_routed timestamp without time zone,
      time_finished timestamp without time zone,
      first_unit_dispatch timestamp without time zone,
      first_unit_enroute timestamp without time zone,
      first_unit_arrive timestamp without time zone,
      first_unit_transport timestamp without time zone,
      last_unit_clear timestamp without time zone,
      time_closed timestamp without time zone,
      close_code_id int,
      close_comm text,
      
      CONSTRAINT call_pk PRIMARY KEY (call_id),
      
      CONSTRAINT call_source_call_fk
        FOREIGN KEY (call_source_id) REFERENCES call_source (call_source_id),
      CONSTRAINT call_unit_call_primary_unit_fk
        FOREIGN KEY (primary_unit_id) REFERENCES call_unit (call_unit_id),
      CONSTRAINT call_unit_call_first_dispatched_fk
        FOREIGN KEY (first_dispatched_id) REFERENCES call_unit (call_unit_id),
      CONSTRAINT call_unit_call_reporting_unit_fk
        FOREIGN KEY (reporting_unit_id) REFERENCES call_unit (call_unit_id),
      CONSTRAINT city_call_fk
        FOREIGN KEY (city_id) REFERENCES city (city_id),
      CONSTRAINT close_code_call_fk
        FOREIGN KEY (close_code_id) REFERENCES close_code (close_code_id),
      CONSTRAINT incident_call_fk
        FOREIGN KEY (case_id) REFERENCES incident (case_id),
      CONSTRAINT nature_call_fk
        FOREIGN KEY (nature_id) REFERENCES nature (nature_id)
    );
    """)
    
    db.query("""
    CREATE TABLE note_author
    (
      note_author_id serial NOT NULL,
      descr text,
      CONSTRAINT note_author_pk PRIMARY KEY (note_author_id)
    );
    """)
    
    db.query("""
    CREATE TABLE note
    (
      note_id serial NOT NULL,
      body text,
      time_recorded timestamp without time zone,
      note_author_id int,
      call_id bigint,
      CONSTRAINT note_pk PRIMARY KEY (note_id),
      
      CONSTRAINT call_note_fk FOREIGN KEY (call_id) REFERENCES call (call_id),
      CONSTRAINT note_author_note_fk FOREIGN KEY (note_author_id) REFERENCES note_author (note_author_id)
    );
    """)

    db.query("""
    CREATE TABLE transaction
    (
      transaction_id serial NOT NULL,
      descr text,
      CONSTRAINT transaction_pk PRIMARY KEY (transaction_id)
    )
    """)
    
    db.query("""
    CREATE TABLE call_log
    (
      call_log_id bigint NOT NULL,
      transaction_id int,
      time_recorded timestamp without time zone,
      call_id bigint,
      call_unit_id int,
      close_code_id int,
      
      CONSTRAINT call_log_pk PRIMARY KEY (call_log_id),
      
      CONSTRAINT call_unit_call_log_fk FOREIGN KEY (call_unit_id) REFERENCES call_unit (call_unit_id),
      CONSTRAINT call_call_log_fk FOREIGN KEY (call_id) REFERENCES call (call_id),
      CONSTRAINT close_code_call_log_fk FOREIGN KEY (close_code_id) REFERENCES close_code (close_code_id),
      CONSTRAINT transaction_call_log_fk FOREIGN KEY (transaction_id) REFERENCES transaction (transaction_id)
    );
    """)
    

    
reset_db()

#Small lookup tables v2

In [111]:
# There are a million of these, so let's make life easier and reuse all that code

# We need to save the mapping between DPD's short codes and our database ids so we can apply it to the records
# in the main tables
#
# These have the DPD's codes as keys and our internal database PKs as values
case_status_code_mapping = {}
division_code_mapping = {}
unit_code_mapping = {}
bureau_code_mapping = {}
investigation_status_code_mapping = {}
call_source_code_mapping = {}
close_code_code_mapping = {}

lookup_jobs = [
    {
        "file": "LWMAIN.CSSTATUS.csv",
        "table": "case_status",
        "mapping": {"descriptn": "descr"},
        "code_mapping": case_status_code_mapping
    },
    {
        "file": "LWMAIN.EMDIVISION.csv",
        "table": "division",
        "mapping": {"descriptn": "descr"},
        "code_mapping": division_code_mapping
    },
    {
        "file": "LWMAIN.EMSECTION.csv",
        "table": "unit",
        "mapping": {"descriptn": "descr"},
        "code_mapping": unit_code_mapping
    },
    {
        "file": "LWMAIN.EMUNIT.csv",
        "table": "bureau",
        "mapping": {"descriptn": "descr"},
        "code_mapping": bureau_code_mapping
    },
    {
        "file": "LWMAIN.INVSTSTATS.csv",
        "table": "investigation_status",
        "mapping": {"descriptn": "descr"},
        "code_mapping": investigation_status_code_mapping
    },
    {
        "file": "inmain.callsource.tsv",
        "table": "call_source",
        "mapping": {"Description": "descr"},
        "code_mapping": call_source_code_mapping
    },
    {
        "file": "inmain.closecode.tsv",
        "table": "close_code",
        "mapping": {"Description": "descr"},
        "code_mapping": close_code_code_mapping
    }
]

for job in lookup_jobs:
    print("loading %s into %s" % (job['file'], job['table']))
    
    if job['file'].endswith(".csv"):
        data = pd.read_csv("../csv_data/%s" % (job['file']))
    elif job['file'].endswith(".tsv"):
        data = pd.read_csv("../csv_data/%s" % (job['file']), sep='\t')
    
    # Keep track of the ids, as the data is ordered, so these will be the same assigned by the incrementing
    # primary key in the database.
    id_ = 1    
    for (i,row) in data.iterrows():
        job['code_mapping'][row['code_agcy']] = id_
        id_ += 1

    # Keep only the desired columns
    keep_columns = set(job['mapping'].keys())
    for c in data.columns:
        if c not in keep_columns:
            data = data.drop(c, axis=1)
            
    # Change the column names to the ones we want and insert the data
    data.rename(columns=job['mapping'], inplace=True)
    data.to_sql(job['table'], engine, index=False, if_exists='append')

loading LWMAIN.CSSTATUS.csv into case_status
loading LWMAIN.EMDIVISION.csv into division
loading LWMAIN.EMSECTION.csv into unit
loading LWMAIN.EMUNIT.csv into bureau
loading LWMAIN.INVSTSTATS.csv into investigation_status
loading inmain.callsource.tsv into call_source
loading inmain.closecode.tsv into close_code


In [112]:
#These have to create "nested" tables and are a little tougher, but we can still reuse the code

# Still need to keep track of the mappings
weapon_code_mapping = {}
premise_code_mapping = {}

nested_lookup_jobs = [
    {
        "file": "LWMAIN.PREMISE.csv",
        "outer_table": "premise",
        "inner_table": "premise_group",
        "outer_cols": ["premise_group_id","descr"],
        "inner_col": "descr",
        "inner_id": "premise_group_id",
        "code_mapping": premise_code_mapping
    },
    {
        "file": "LWMAIN.WEAPON.csv",
        "outer_table": "weapon",
        "inner_table": "weapon_group",
        "outer_cols": ["weapon_group_id","descr"],
        "inner_col": "descr",
        "inner_id": "weapon_group_id",
        "code_mapping": weapon_code_mapping
    }
]

for job in nested_lookup_jobs:
    print("loading %s into %s and %s" % (job['file'], job['outer_table'], job['inner_table']))
    data = pd.read_csv("../csv_data/%s" % (job['file']))
    
    # load the group table by getting all the unique groups
    inner_data = data['descriptn_a'].drop_duplicates()
    inner_data.name = job['inner_col']
    inner_data.to_sql(job['inner_table'], engine, index=False, if_exists='append')
    
    # Learn the mapping between groups and group_ids in the database so we can insert the proper
    # group_ids with the outer tables
    groups = {}
    for row in db.query("SELECT * FROM %s" % (job['inner_table'])):
        groups[row[job['inner_col']]] = row[job['inner_id']]
       
    # Figure out what the database ids will be, so we can convert DPD's columns to the database ids in the
    # main table load
    id_ = 1
    for (i,row) in data.iterrows():
        job['code_mapping'][row['code_agcy']] = id_
        id_ += 1
    
    # Concatenate and rename the series we want
    outer_data = pd.concat([data['descriptn_a'], data['descriptn_b']], axis=1, keys=job['outer_cols'])
    
    # use the groups mapping to turn group names into ids from our database
    outer_data[job['inner_id']] = outer_data[job['inner_id']].map(lambda x: groups[x])
    
    # Store the records
    outer_data.to_sql(job['outer_table'], engine, index=False, if_exists='append')

loading LWMAIN.PREMISE.csv into premise and premise_group
loading LWMAIN.WEAPON.csv into weapon and weapon_group


#cfs_2014_lwmain.csv

In [113]:
chunksize = 20000

ucr_descr_code_mapping = {}

city_code_mapping = {}

def safe_strip(str_):
    try:
        return str_.strip()
    except AttributeError:
        return str_
    
city = pd.DataFrame()

print("loading lookup tables")

# We'll start out by doing a pass through the file and loading the lookup tables we need (ucr_descr, city)
for incident in pd.read_csv('../csv_data/cfs_2014_lwmain.csv', chunksize=chunksize, 
                       iterator=True, encoding='ISO-8859-1', low_memory=False):
    
    #Strip extraneous white space and turn resulting blanks into NULLs
    incident = incident.applymap(safe_strip).applymap(lambda x: None if x == '' or pd.isnull(x) else x)
    
    # Turn the ucr_descrs into pairs, since it's the pairs that are unique
    ucr_descr_pairs = pd.concat([incident['arr_chrg'], incident['chrgdesc']], axis=1)
    ucr_descr_pairs = ucr_descr_pairs.drop_duplicates()
    
    # Add the cities in the current chunk to the dataframe
    city = pd.concat([city, incident['city']], axis=0)
    city = city.drop_duplicates()

# switch to our column names
city.rename(columns={0:'descr'}, inplace=True)
ucr_descr_pairs.rename(columns={'arr_chrg': 'short_descr', 'chrgdesc': 'long_descr'}, inplace=True)

# we don't need nulls in a lookup table
city = city[~city.descr.isnull()]

#store the records
city.to_sql('city', engine, index=False, if_exists='append')
ucr_descr_pairs.to_sql('ucr_descr', engine, index=False, if_exists='append')

print("lookup tables loaded")

loading lookup tables
lookup tables loaded


#OLD CODE TO BE REWRITTEN BELOW

##cfs_2014_inmain.csv

In [30]:
import re

timestamp_expr = re.compile("\[(\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) (.+?)\]")

def split_notes_dict(notes,call_id):
    """
    Return a list of dicts.  Each dict represents a single note and contains the corresponding call_id,
    the timestamp, the note-taker, and the text of the note.
    """
    dicts = []
    regex_split = timestamp_expr.split(notes)[:-1]  # get rid of the last empty string created by the split
    for i in range(0,len(regex_split),3):
        text = regex_split[i].strip()
        timestamp = dt.datetime.strptime(regex_split[i+1], "%m/%d/%y %H:%M:%S")
        author = regex_split[i+2]
        dicts.append({"text": text, "timestamp": timestamp, "author": author, "call_id": call_id})
    return dicts

def split_notes(notes):
    """
    Return a list of tuples.  Each tuple represents a single note and contains the corresponding call_id,
    the timestamp, the note-taker, and the text of the note.
    """
    notes = str(notes)
    tuples = []
    regex_split = timestamp_expr.split(notes)[:-1]  # get rid of the last empty string created by the split
    for i in range(0,len(regex_split),3):
        text = regex_split[i].strip()
        timestamp = dt.datetime.strptime(regex_split[i+1], "%m/%d/%y %H:%M:%S")
        author = regex_split[i+2]
        tuples.append((text, timestamp, author))
    return tuples

def safe_strip(str_):
    try:
        return str_.strip()
    except AttributeError:
        return str_
    
def clean_caseid(c):
    c = str(c).replace('nan','').replace('-','').replace(' ','')
    return None if c == '' else int(c)

start = dt.datetime.now()
# load the data in chunks so we don't use too much memory
chunksize = 20000
j = 0

# We need to map the inmain columns to the renamed columns in the call table
# if an inmain column isn't in this dict, it means we need to drop it
call_mappings = {
    "inci_id": "call_id",
    "calltime": "call_time",
    "calldow": "call_dow",
    "case_id": "case_id",
    "callsource": "call_source",
    "primeunit": "primary_unit",
    "firstdisp": "first_dispatched",
    "streetno": "street_num",
    "streetonly": "street_name",
    "citydesc": "city_desc",
    "zip": "zip",
    "crossroad1": "crossroad1",
    "crossroad2": "crossroad2",
    "geox": "geox",
    "geoy": "geoy",
    "service": "service",
    "agency": "agency",
    "statbeat": "beat",
    "district": "district",
    "ra": "sector",
    "business": "business",
    "naturecode": "nature_code",
    "nature": "nature_desc",
    "priority": "priority",
    "rptonly": "report_only",
    "cancelled": "cancelled",
    "timeroute": "time_enroute",
    "timefini": "time_finished",
    "firstdtm": "first_unit_dispatch",
    "firstenr": "first_unit_enroute",
    "firstarrv": "first_unit_arrive",
    "firsttran": "first_unit_transport",
    "lastclr": "last_unit_clear",
    "timeclose": "time_closed",
    "reptaken": "reporting_unit",
    "closecode": "close_code",
    "closecomm": "close_comm"
}

keep_columns = set(call_mappings.keys())

for call in pd.read_csv('../csv_data/cfs_2014_inmain.csv', chunksize=chunksize, iterator=True, encoding='ISO-8859-1',
                       low_memory=False):
    
    """
    nice, clean iterative algorithm for separating out the notes data -- unfortunately, it's prohibitively slow
    (~3 mins per 25k record or thereabouts)
    """
    #for index, row in call.iterrows():
    #    note = note.append(pd.DataFrame(split_notes_dict(str(row['notes']), row['inci_id'])))
        #if call.iloc[i]['naturecode'] not in nature_set:
        #    nature_set.add(call.iloc[i]['naturecode'])
        #    nature = nature.append(pd.DataFrame({"nature_code": [call.iloc[i]['naturecode']],
        #                                "nature_desc": [call.iloc[i]['nature']]}))
   
    """
    Horrid ugly algorithm for separating out the notes data -- it's faster by about 10x though
    Pandas is really slow when iterating on rows, so we have to do all the transformations to a whole series/list
    at a time
    """
    # Create a new series, which is (for each call) a list of tuples containing the text, author, and timestamp
    # of that call:
    # ex. Series(["one long string with text, author, timestamp for all remarks"]) -> 
    #     Series([(text, author, timestamp), (text2, author2, timestamp2)])
    call['collected_notes'] = call['notes'].apply(split_notes)
    
    # Combine the previous series with the inci_id of each row, preserving the relationship between inci_id
    # and each individual remark, then convert it to a list so we can reduce and map
    # ex. Series([(text, author, timestamp), (text2, author2, timestamp2)]) ->
    #     [((text, author, timestamp), inci_id), ((text2, author2, timestamp2), inci_id2)]
    combined_notes = call['collected_notes'].combine(call['inci_id'],
                                                          lambda x,y: [(e,y) for e in x]).tolist()
    
    # Reduce the list of lists using extend; instead of a list of lists of tuples, we have one long list of
    # nested tuples
    # ex. [[((text, author, timestamp), inci_id)], [((text2, author2, timestamp2), inci_id2)]] ->
    #     [((text, author, timestamp), inci_id), ((text2, author2, timestamp2), inci_id2)]
    extended_notes = []
    for l in combined_notes:
        extended_notes.extend(l)
    
    # Flatten the tuples, so we have a list of non-nested tuples
    # ex. [((text, author, timestamp), inci_id), ((text2, author2, timestamp2), inci_id2)] ->
    #     [(text, author, timestamp, inci_id), (text2, author2, timestamp2, inci_id2)]
    extended_notes = map(lambda x: (x[0][0],x[0][1],x[0][2],x[1]), extended_notes)
    
    # Create a dataframe from the list of tuples (whew)
    note = pd.DataFrame.from_records(extended_notes, columns=['text','timestamp','author','call_id'])
    
    # drop unnecessary columns
    for c in call.columns:
        if c not in keep_columns:
            call = call.drop(c, axis=1)   
    
    # rename to the CFS Analytics column names
    call.rename(columns=call_mappings, inplace=True)
    

    
    ##### USING DPD COLUMN NAMES ABOVE #########
    ##### USING CFS ANALYTICS COLUMN NAMES BELOW ######
    
    # get rid of some weird records that break the case_id cleanup
    call = call[~(call.call_id.isin((2014055521,2014269353)))]
    note = note[~(note.call_id.isin((2014055521,2014269353)))]
    
    # clean up the case_id column
    call['case_id'] = call['case_id'].map(clean_caseid)
    
    # Perform datetime conversions
    call['call_time'] = pd.to_datetime(call['call_time'])
    call['time_enroute'] = pd.to_datetime(call['time_enroute'])
    call['time_finished'] = pd.to_datetime(call['time_finished'])
    call['first_unit_dispatch'] = pd.to_datetime(call['first_unit_dispatch'])
    call['first_unit_enroute'] = pd.to_datetime(call['first_unit_enroute'])
    call['first_unit_arrive'] = pd.to_datetime(call['first_unit_arrive'])
    call['first_unit_transport'] = pd.to_datetime(call['first_unit_transport'])
    call['last_unit_clear'] = pd.to_datetime(call['last_unit_clear'])
    call['time_closed'] = pd.to_datetime(call['time_closed'])

    # progress update
    j+=1
    print('{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunksize))
    
    # get rid of excess whitespace
    call = call.applymap(safe_strip)
    note = note.applymap(safe_strip)
    
    # store in the database
    call.to_sql('call', engine, index=False, if_exists='append')
    note.to_sql('note', engine, index=False, if_exists='append')

8 seconds: completed 20000 rows
80 seconds: completed 40000 rows
150 seconds: completed 60000 rows
220 seconds: completed 80000 rows
293 seconds: completed 100000 rows
372 seconds: completed 120000 rows
449 seconds: completed 140000 rows
526 seconds: completed 160000 rows
606 seconds: completed 180000 rows
688 seconds: completed 200000 rows
769 seconds: completed 220000 rows
849 seconds: completed 240000 rows
929 seconds: completed 260000 rows
1012 seconds: completed 280000 rows
1093 seconds: completed 300000 rows
1179 seconds: completed 320000 rows
1261 seconds: completed 340000 rows
1345 seconds: completed 360000 rows
1427 seconds: completed 380000 rows


#cfs_xxx2014_incilog.csv
There is one of these for each month, so we have to load them separately.

In [31]:
months = ("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec")

def safe_strip(str_):
    try:
        return str_.strip()
    except AttributeError:
        return str_

for month in months:
    start = dt.datetime.now()
    print("Starting load for month: %s" % (month))
    # load the data in chunks so we don't use too much memory
    chunksize = 20000
    j = 0

    # We need to map the incilog columns to the renamed columns in the call_log table
    # if an incilog column isn't in this dict, it means we need to drop it
    call_log_mappings = {
        "incilogid": "call_log_id",
        "transtype": "transaction_code",
        "descript": "transaction_desc",
        "timestamp": "timestamp",
        "inci_id": "call_id",
        "unitcode": "unit_code",
        "radorev": "radio_or_event",
        "unitperid": "unitper_id",
        "closecode": "close_code"
    }
    
    keep_columns = set(call_log_mappings.keys())

    for call_log in pd.read_csv('../csv_data/cfs_%s2014_incilog.csv' % (month), chunksize=chunksize, 
                           iterator=True, encoding='ISO-8859-1', low_memory=False):
        for c in call_log.columns:
            if c not in keep_columns:
                call_log = call_log.drop(c, axis=1)

        # rename to the CFS Analytics column names
        call_log.rename(columns=call_log_mappings, inplace=True)

        ##### USING DPD COLUMN NAMES ABOVE #########
        ##### USING CFS ANALYTICS COLUMN NAMES BELOW ######
            
        # Perform datetime conversions
        call_log['timestamp'] = pd.to_datetime(call_log['timestamp'])
        
        # progress update
        j+=1
        print('{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunksize))

        # strip excess whitespace
        call_log = call_log.applymap(safe_strip)
        
        # store in the database
        call_log.to_sql('call_log', engine, index=False, if_exists='append')

Starting load for month: jan
1 seconds: completed 20000 rows
17 seconds: completed 40000 rows
32 seconds: completed 60000 rows
48 seconds: completed 80000 rows
64 seconds: completed 100000 rows
81 seconds: completed 120000 rows
97 seconds: completed 140000 rows
113 seconds: completed 160000 rows
129 seconds: completed 180000 rows
145 seconds: completed 200000 rows
161 seconds: completed 220000 rows
Starting load for month: feb
1 seconds: completed 20000 rows
18 seconds: completed 40000 rows
34 seconds: completed 60000 rows
50 seconds: completed 80000 rows
65 seconds: completed 100000 rows
81 seconds: completed 120000 rows
97 seconds: completed 140000 rows
113 seconds: completed 160000 rows
130 seconds: completed 180000 rows
145 seconds: completed 200000 rows
Starting load for month: mar
1 seconds: completed 20000 rows
17 seconds: completed 40000 rows
33 seconds: completed 60000 rows
50 seconds: completed 80000 rows
66 seconds: completed 100000 rows
82 seconds: completed 120000 rows
98 

#cfs_2014_lwmain.csv

In [34]:
def combine_date_time(str_date, str_time):
    date = dt.datetime.strptime(str_date, "%m/%d/%y")
    time = dt.datetime.strptime(str_time, "%I:%M %p")
    return dt.datetime(date.year, date.month, date.day, time.hour, time.minute)

def safe_strip(str_):
    try:
        return str_.strip()
    except AttributeError:
        return str_

start = dt.datetime.now()
# load the data in chunks so we don't use too much memory
chunksize = 20000
j = 0

# We need to map the incilog columns to the renamed columns in the call_log table
# if an incilog column isn't in this dict, it means we need to drop it
incident_mappings = {
    "lwmainid": "incident_id",
    "inci_id": "case_id",
    "time": "time_filed",
    "streetnbr": "street_num",
    "street": "street_name",
    "city": "city",
    "zip": "zip",
    "geox": "geox",
    "geoy": "geoy",
    "tract": "beat",
    "district": "district",
    "reportarea": "sector",
    "premise": "premise_code",
    "weapon": "weapon_code",
    "domestic": "domestic",
    "juvenile": "juvenile",
    "gangrelat": "gang_related",
    "emunit": "emp_bureau_code",
    "emdivision": "emp_division_code",
    "emsection": "emp_unit_code",
    "asst_offcr": "num_officers",
    "invststats": "investigation_status_code",
    "investunit": "investigator_unit_code",
    "csstatus": "case_status_code",
    "lwchrgid": "lwchrgid",
    "chrgcnt": "charge_seq",
    "ucr_code": "ucr_code",
    "arr_chrg": "ucr_short_desc",
    "attm_comp": "attempted_or_committed"
}

keep_columns = set(incident_mappings.keys())

ucr_desc = pd.DataFrame({"ucr_short_desc": [], "ucr_long_desc": []})

for incident in pd.read_csv('../csv_data/cfs_2014_lwmain.csv', chunksize=chunksize, 
                       iterator=True, encoding='ISO-8859-1', low_memory=False):
    
    ucr_desc = ucr_desc.append(pd.concat([ incident['arr_chrg'],
                                           incident['chrgdesc'] ],
                                        axis=1, keys=['ucr_short_desc', 'ucr_long_desc']))
    
    # Perform datetime conversions
    incident['time'] = incident['date_rept'].combine(incident['time'], combine_date_time)
    
    for c in incident.columns:
        if c not in keep_columns:
            incident = incident.drop(c, axis=1)

    # rename to the CFS Analytics column names
    incident.rename(columns=incident_mappings, inplace=True)

    ##### USING DPD COLUMN NAMES ABOVE #########
    ##### USING CFS ANALYTICS COLUMN NAMES BELOW ######
    
    # strip whitespace
    incident = incident.applymap(safe_strip)
    ucr_desc = ucr_desc.applymap(safe_strip)
    
    # convert empty strings in num_officers to nulls so we can insert as an int column
    incident['num_officers'] = incident['num_officers'].map(lambda x: None if x == '' else x)
    
    # These "primary key" values have two records and I don't want to deal with it
    incident = incident[~(incident.incident_id.isin((498659, 503578, 521324)))]
    
    # Drop duplicate ucr_descs
    ucr_desc = ucr_desc.drop_duplicates()
    
    # progress update
    j+=1
    print('{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunksize))

    incident = incident.applymap(safe_strip)
    
    # store in the database
    incident.to_sql('incident', engine, index=False, if_exists='append')

ucr_desc.to_sql('ucr_desc', engine, index=False, if_exists='append')

1 seconds: completed 20000 rows
21 seconds: completed 40000 rows


#cfs_2014_lwmodop.csv

In [35]:
def safe_strip(str_):
    try:
        return str_.strip()
    except AttributeError:
        return str_

start = dt.datetime.now()
# load the data in chunks so we don't use too much memory
# strange unexplainable crash using the usual 20k chunk size (and 10k sometimes? and 5k sometimes? this makes no sense)
# so go with 20k (no) 10k (no) 5k (no)
# actually just put your favorite number here and hope it doesn't crash
chunksize = 2500
j = 0

# We need to map the incilog columns to the renamed columns in the call_log table
# if an incilog column isn't in this dict, it means we need to drop it
modop_mappings = {
    "lwmainid": "incident_id",
    "lwmodopid": "mo_id",
    "mogroup": "mo_group_code",
    "moitem": "mo_item_code"
}

keep_columns = set(modop_mappings.keys())

mo_item = pd.DataFrame({"mo_item_code": [], "mo_item_desc": [], "mo_group_code": [], "mo_group_desc": []})

for modop in pd.read_csv('../csv_data/cfs_2014_lwmodop.csv', chunksize=chunksize, 
                       iterator=True, low_memory=False):
    
    mo_item = mo_item.append(pd.concat([ modop['moitem'],
                                         modop['itemdesc'],
                                         modop['mogroup'],
                                         modop['groupdesc'] ],
                                        axis=1, keys=['mo_item_code', 'mo_item_desc',
                                                      'mo_group_code', 'mo_group_desc']))

    for c in modop.columns:
        if c not in keep_columns:
            modop = modop.drop(c, axis=1)

    # rename to the CFS Analytics column names
    modop.rename(columns=modop_mappings, inplace=True)

    ##### USING DPD COLUMN NAMES ABOVE #########
    ##### USING CFS ANALYTICS COLUMN NAMES BELOW ######
    
    modop = modop.applymap(safe_strip)
    mo_item = mo_item.applymap(safe_strip)
    
    # The group codes are getting a decimal place for some reason.  convert them to ints
    mo_item['mo_group_code'] = mo_item['mo_group_code'].map(lambda x: str(int(x)))
    
    # Drop duplicate mo_items
    mo_item = mo_item.drop_duplicates()
    
    # Gotta get rid of any of the incident records we had to drop due to duplicate "primary keys"
    modop = modop[~(modop.incident_id.isin((498659, 503578, 521324)))]
    
    # progress update
    j+=1
    print('{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunksize))
    
    # store in the database
    modop.to_sql('modus_operandi', engine, index=False, if_exists='append')

# Fix weird exception row causing a key error)
mo_item['mo_item_desc'] = mo_item['mo_item_desc'].map(lambda x: "Discharged" if x == "Discharged34" else x)
mo_item.to_sql('mo_item', engine, index=False, if_exists='append')

0 seconds: completed 2500 rows
1 seconds: completed 5000 rows
3 seconds: completed 7500 rows
4 seconds: completed 10000 rows
6 seconds: completed 12500 rows
8 seconds: completed 15000 rows
9 seconds: completed 17500 rows
11 seconds: completed 20000 rows
13 seconds: completed 22500 rows
14 seconds: completed 25000 rows
16 seconds: completed 27500 rows
17 seconds: completed 30000 rows
19 seconds: completed 32500 rows
21 seconds: completed 35000 rows
22 seconds: completed 37500 rows
24 seconds: completed 40000 rows
26 seconds: completed 42500 rows


#Adding foreign key constraints
We can't add some of the foreign key constraints until all the data is in there, so we'll do that down here

In [36]:
engine.execute("""
ALTER TABLE incident
ADD CONSTRAINT incident_ucr_short_desc_fkey FOREIGN KEY (ucr_short_desc) REFERENCES ucr_desc (ucr_short_desc);
""")

engine.execute("""
ALTER TABLE modus_operandi
ADD CONSTRAINT mo_mo_item_code_fkey
FOREIGN KEY (mo_item_code, mo_group_code) REFERENCES mo_item (mo_item_code, mo_group_code);
""")

engine.execute("""
ALTER TABLE weapon
ADD CONSTRAINT weapon_weapon_desc_fk FOREIGN KEY (weapon_desc) REFERENCES weapon_group (weapon_desc);
""")

engine.execute("""
ALTER TABLE premise
ADD CONSTRAINT premise_premise_desc_fk FOREIGN KEY (premise_desc) REFERENCES premise_group (premise_desc);
""")

<sqlalchemy.engine.result.ResultProxy at 0x10b9727b8>