# Data Science Project

* Name: Author Name
* Email:


## TABLE OF CONTENTS 


- **[Introduction](#INTRODUCTION)<br>**
- **[OBTAIN](#OBTAIN)**<br>
- **[SCRUB](#SCRUB)**<br>
- **[EXPLORE](#EXPLORE)**<br>
- **[MODEL](#MODEL)**<br>
- **[iNTERPRET](#iNTERPRET)**<br>
- **[Conclusions/Recommendations](#CONCLUSIONS-&-RECOMMENDATIONS)<br>**
___

# INTRODUCTION

> Explain the point of your project and what question you are trying to answer with your modeling.



In [1]:
# Importing packages
import pandas as pd
from pandasql import sqldf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gzip
import shutil
import os
import sqlite3
import db_to_sqlite
from sqlite3 import Error
import csv
from pathlib import Path
import subprocess
import io
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning)


%matplotlib inline

In [2]:
# This function is needed to easily display the dataframe from a csv file with all the columns names

def display_csvfileDF(file_name, folder):
    df = pd.read_csv('data/'+folder+file_name, header=0, encoding='UTF-8')
    return df

def table_query(q):
    df = pd.DataFrame(cur.execute(q))
    df.columns = [x[0] for x in cur.description]
    return df

def import_data_to_tables(db_path_name, list_of_files, replace_dir_name):
    db_name = Path(db_path_name).resolve()

    for entry in list_of_files:
        csv_file = Path(entry).resolve()
        result = subprocess.run(['sqlite3',
                                 str(db_name),
                                 '-cmd',
                                 '.mode csv',
                                 '.import --skip 1 ' + str(csv_file).replace('\\','\\\\')
                                 + ' ' + entry.replace('.csv','').replace(replace_dir_name,'')],
                                capture_output=True)
    return

def create_filelist(dir_name,n=2):
    files_incidents=[]
    files_incidents = os.listdir(dir_name)
    files_incidents=files_incidents[n:]
    files_incidents=[dir_name + s for s in files_incidents]
    return files_incidents

def copy_files(file_list, dir_out, dir_in):
    for entry in file_list:
        shutil.copyfile(dir_out+entry,dir_in+entry)

# OBTAIN

## Data

### Data source

Data is from FBI Crime Data Explorer
[NIBRS data for Colorado from 2009-2019](https://crime-data-explorer.fr.cloud.gov/pages/downloads)

The [data dictionary](data/NIBRS_DataDictionary.pdf) is  and a [record descriptiopn](data/NIBRS_Record_Description.pdf) are available.


The description of the main and reference tables is in data/README.md file.
The agency implemented some changes to the files structure in 2016 and removed the sqlite create and load scripts from the zip directories.
Another fact worth mentioning is that files 'nibrs_property_desc.csv' from 2014 and 2015 have duplicated nibrs_property_desc_ids (unique identifier in the nibrs_property_desc table) which complicated the loading of the data.

### Using an already created sqlite database

The notebook with database creation is 

In [3]:
#I created a separate directory with only incident data files as a template for lists of data from 2009-2015

list_template_early=os.listdir('data/incidents/template_data/')
list_template_early

['agency_participation.csv',
 'cde_agencies.csv',
 'nibrs_arrestee.csv',
 'nibrs_arrestee_weapon.csv',
 'nibrs_bias_motivation.csv',
 'nibrs_criminal_act.csv',
 'nibrs_incident.csv',
 'nibrs_month.csv',
 'nibrs_offender.csv',
 'nibrs_offense.csv',
 'nibrs_property.csv',
 'nibrs_property_desc.csv',
 'nibrs_suspected_drug.csv',
 'nibrs_suspect_using.csv',
 'nibrs_victim.csv',
 'nibrs_victim_circumstances.csv',
 'nibrs_victim_injury.csv',
 'nibrs_victim_offender_rel.csv',
 'nibrs_victim_offense.csv',
 'nibrs_weapon.csv']

In [4]:
# List of incident data fiels from 2016-2019

list_template_late=list_template_early[2:]
list_template_late.append('agencies.csv')
list_template_late

['nibrs_arrestee.csv',
 'nibrs_arrestee_weapon.csv',
 'nibrs_bias_motivation.csv',
 'nibrs_criminal_act.csv',
 'nibrs_incident.csv',
 'nibrs_month.csv',
 'nibrs_offender.csv',
 'nibrs_offense.csv',
 'nibrs_property.csv',
 'nibrs_property_desc.csv',
 'nibrs_suspected_drug.csv',
 'nibrs_suspect_using.csv',
 'nibrs_victim.csv',
 'nibrs_victim_circumstances.csv',
 'nibrs_victim_injury.csv',
 'nibrs_victim_offender_rel.csv',
 'nibrs_victim_offense.csv',
 'nibrs_weapon.csv',
 'agencies.csv']

> I commented out the following cell to avoid overwriting changes to the directories

In [5]:
# copy_files(list_template_early, '/Users/elena/Desktop/FBI_crime_files/CO-2009/', 'data/incidents/2009/')
# copy_files(list_template_early, '/Users/elena/Desktop/FBI_crime_files/CO-2010/', 'data/incidents/2010/')
# copy_files(list_template_early, '/Users/elena/Desktop/FBI_crime_files/CO-2011/', 'data/incidents/2011/')
# copy_files(list_template_early, '/Users/elena/Desktop/FBI_crime_files/CO-2012/', 'data/incidents/2012/')
# copy_files(list_template_early, '/Users/elena/Desktop/FBI_crime_files/CO-2013/', 'data/incidents/2013/')
# copy_files(list_template_early, '/Users/elena/Desktop/FBI_crime_files/CO-2014/', 'data/incidents/2014/')
# copy_files(list_template_early, '/Users/elena/Desktop/FBI_crime_files/CO-2015/', 'data/incidents/2015/')

In [6]:
# copy_files(list_template_late, '/Users/elena/Desktop/FBI_crime_files/CO-2016/', 'data/incidents/2016/')
# copy_files(list_template_late, '/Users/elena/Desktop/FBI_crime_files/CO-2017/', 'data/incidents/2017/')
# copy_files(list_template_late, '/Users/elena/Desktop/FBI_crime_files/CO-2018/', 'data/incidents/2018/')
# copy_files(list_template_late, '/Users/elena/Desktop/FBI_crime_files/CO-2019/', 'data/incidents/2019/')

In [7]:
# Initiating a cursor
conn = sqlite3.connect('data/sqlite/db/test.db')
cur = conn.cursor()

In [8]:
cur.execute("""SELECT name FROM sqlite_master WHERE type='table'""").fetchall()

[]

In [9]:
# Creating tables in the database
sql_file = open('script_to_create_tables.sql')
sql_as_string = sql_file.read()
cur.executescript(sql_as_string)


<sqlite3.Cursor at 0x24912922960>

In [10]:
cur.execute("""SELECT name FROM sqlite_master WHERE type='table'""").fetchall()

[('agencies',),
 ('agency_participation',),
 ('cde_agencies',),
 ('nibrs_activity_type',),
 ('nibrs_age',),
 ('nibrs_arrest_type',),
 ('nibrs_assignment_type',),
 ('nibrs_bias_list',),
 ('nibrs_location_type',),
 ('nibrs_offense_type',),
 ('nibrs_prop_desc_type',),
 ('nibrs_victim_type',),
 ('nibrs_circumstances',),
 ('nibrs_cleared_except',),
 ('nibrs_criminal_act',),
 ('nibrs_criminal_act_type',),
 ('nibrs_drug_measure_type',),
 ('nibrs_ethnicity',),
 ('nibrs_injury',),
 ('nibrs_justifiable_force',),
 ('nibrs_prop_loss_type',),
 ('nibrs_relationship',),
 ('nibrs_suspected_drug_type',),
 ('nibrs_using_list',),
 ('nibrs_weapon_type',),
 ('ref_race',),
 ('ref_state',),
 ('nibrs_arrestee',),
 ('nibrs_arrestee_weapon',),
 ('nibrs_bias_motivation',),
 ('nibrs_month',),
 ('nibrs_incident',),
 ('nibrs_offender',),
 ('nibrs_offense',),
 ('nibrs_property',),
 ('nibrs_property_desc',),
 ('nibrs_suspect_using',),
 ('nibrs_suspected_drug',),
 ('nibrs_victim',),
 ('nibrs_victim_circumstances',),
 

In [11]:
display_csvfileDF('nibrs_age.csv', 'Ref_tables/')


Unnamed: 0,age_id,age_code,age_name
0,1,NN,Under 24 Hours
1,2,NB,1-6 Days Old
2,3,BB,7-364 Days Old
3,4,00,Unknown
4,5,AG,Age in Years
5,6,99,Over 98 Years Old


In [12]:
cur.execute("""SELECT * FROM nibrs_age""").fetchall()

[]

In [13]:
 #All reference table files are in this directory, the actual incident data files are in all data/incidents, split by years
!ls -al data/Ref_tables/

total 56
drwxr-xr-x 1 elena 197121    0 Jun 30 14:49 .
drwxr-xr-x 1 elena 197121    0 Jun 30 15:02 ..
-rw-r--r-- 1 elena 197121  477 Jun 30 14:44 nibrs_activity_type.csv
-rw-r--r-- 1 elena 197121  137 Jun 30 14:44 nibrs_age.csv
-rw-r--r-- 1 elena 197121  105 Jun 30 14:44 nibrs_arrest_type.csv
-rw-r--r-- 1 elena 197121  266 Jun 30 14:44 nibrs_assignment_type.csv
-rw-r--r-- 1 elena 197121  993 Jun 30 14:44 nibrs_bias_list.csv
-rw-r--r-- 1 elena 197121  556 Jun 30 14:44 nibrs_circumstances.csv
-rw-r--r-- 1 elena 197121  217 Jun 30 14:44 nibrs_cleared_except.csv
-rw-r--r-- 1 elena 197121  442 Jun 30 14:44 nibrs_criminal_act_type.csv
-rw-r--r-- 1 elena 197121  218 Jun 30 14:44 nibrs_drug_measure_type.csv
-rw-r--r-- 1 elena 197121  134 Jun 30 14:44 nibrs_ethnicity.csv
-rw-r--r-- 1 elena 197121  194 Jun 30 14:44 nibrs_injury.csv
-rw-r--r-- 1 elena 197121  436 Jun 30 14:44 nibrs_justifiable_force.csv
-rw-r--r-- 1 elena 197121 1238 Jun 30 14:44 nibrs_location_type.csv
-rw-r--r-- 1 elena 197121 

In [14]:
# Creating a list of ref table files to import theminto tables
files_ref=create_filelist('data/Ref_tables/',n=0)
files_ref

['data/Ref_tables/nibrs_activity_type.csv',
 'data/Ref_tables/nibrs_age.csv',
 'data/Ref_tables/nibrs_arrest_type.csv',
 'data/Ref_tables/nibrs_assignment_type.csv',
 'data/Ref_tables/nibrs_bias_list.csv',
 'data/Ref_tables/nibrs_circumstances.csv',
 'data/Ref_tables/nibrs_cleared_except.csv',
 'data/Ref_tables/nibrs_criminal_act_type.csv',
 'data/Ref_tables/nibrs_drug_measure_type.csv',
 'data/Ref_tables/nibrs_ethnicity.csv',
 'data/Ref_tables/nibrs_injury.csv',
 'data/Ref_tables/nibrs_justifiable_force.csv',
 'data/Ref_tables/nibrs_location_type.csv',
 'data/Ref_tables/nibrs_offense_type.csv',
 'data/Ref_tables/nibrs_prop_desc_type.csv',
 'data/Ref_tables/nibrs_prop_loss_type.csv',
 'data/Ref_tables/nibrs_relationship.csv',
 'data/Ref_tables/nibrs_suspected_drug_type.csv',
 'data/Ref_tables/nibrs_using_list.csv',
 'data/Ref_tables/nibrs_victim_type.csv',
 'data/Ref_tables/nibrs_weapon_type.csv',
 'data/Ref_tables/ref_race.csv',
 'data/Ref_tables/ref_state.csv']

In [15]:
#importing data into reference tables
import_data_to_tables('data/sqlite/db/test.db', files_ref, 'data/Ref_tables/')

In [16]:
q='SELECT * FROM nibrs_using_list'
df=table_query(q)
df

Unnamed: 0,suspect_using_id,suspect_using_code,suspect_using_name
0,1,A,Alcohol
1,2,C,Computer Equipment
2,3,D,Drugs/Narcotics
3,4,N,Not Applicable


In [17]:
# Importing incidents data from 2009-2015 to the database

list_inc_2009=create_filelist('data/incidents/2009/', n=0)
import_data_to_tables('data/sqlite/db/test.db', list_inc_2009, 'data/incidents/2009/')

list_inc_2010=create_filelist('data/incidents/2010/', n=0)
import_data_to_tables('data/sqlite/db/test.db', list_inc_2010, 'data/incidents/2010/')

list_inc_2011=create_filelist('data/incidents/2011/', n=0)
import_data_to_tables('data/sqlite/db/test.db', list_inc_2011, 'data/incidents/2011/')

list_inc_2012=create_filelist('data/incidents/2012/', n=0)
import_data_to_tables('data/sqlite/db/test.db', list_inc_2012, 'data/incidents/2012/')

list_inc_2013=create_filelist('data/incidents/2013/', n=0)
import_data_to_tables('data/sqlite/db/test.db', list_inc_2013, 'data/incidents/2013/')

list_inc_2014=create_filelist('data/incidents/2014/', n=0)
import_data_to_tables('data/sqlite/db/test.db', list_inc_2014, 'data/incidents/2014/')

list_inc_2015=create_filelist('data/incidents/2015/', n=0)
import_data_to_tables('data/sqlite/db/test.db', list_inc_2015, 'data/incidents/2015/')

In [18]:
q='SELECT * FROM nibrs_incident'
df=table_query(q)
len(df)

1701394

**All 2016-2019 files need to be cleaned up because FBI changed the file format. There is a YEAR column that needs to be removed as well as the legacy columns from the previous years need to be added up. It's a tedious job and it needs to be done once and the files need to be backed up.**

In order to clean the tables up the following needs to be done<br>

   1. Remove all **DATA_YEAR** columns from each file, it's the first column<br>
   
   2. Files that do not need any changes beyond **DATA_YEAR** column removal<br>
    
> nibrs_arrestee_weapon.csv<br>
nibrs_bias_motivation.csv<br>
nibrs_criminal_act.csv<br>
nibrs_property_desc.csv<br>
nibrs_suspect_using.csv<br>
nibrs_suspected_drug.csv<br>
nibrs_victim_circumstances.csv<br>
nibrs_victim_injury.csv<br>
nibrs_victim_offender_rel.csv<br>
nibrs_victim_offense.csv<br>
nibrs_weapon.csv<br>

    
   3. in **nibrs_arestee.csv file**:<br><br>
   a. between **ARRESTEE_SEQ_NUM** and **ARREST_DATE** there should be an **arrest_num column**<br>
   b. Between **CLEARANCE_IND** and **AGE_RANGE_LOW_NUM** should be a **ff_line_number** column. <br>

4.  in **nibrs_incident** file:<br><br>
    a.between **NIBRS_MONTH_ID** and **CARGO_THEFT_FLAG** column **incident_number**<br>
    b.between **DATA_HOME** and **ORIG_FORMAT** column **ddocname**<br>
    c.between **ORIG_FORMAT** and **DID** column	**ff_line_number**<br><br>

5. in **nibrs_month.csv** file:<br><br>
    a.between **REPORT_DATE** and **UPDATE_FLAG** add **prepared_date** column<br>
    b.between **ORIG_FORMAT** and **DATA_HOME** column **ff_line_number**<br>
    c.column **MONTH_PUB_STATUS** removed<br><br>

6. in **nibrs_offender.csv** file:<br><br>
     a.between **ETHNICITY_ID** and **AGE_RANGE_LOW_NUM** column **ff_line_number**<br><br>
     
7. in **nibrs_offense.csv** file:<br><br>
     a. the last column **ff_line_number** should be added<br><br>
   
8. in **nibrs_property.csv** file:<br><br>
     a. the last column **ff_line_number** should be added<br><br>

9. in **nibrs_victim.csv** file:<br><br>
     a. between **RESIDENT_STATUS_CODE** and **AGE_RANGE_LOW_NUM** two columns **agency_data_year** and **ff_line_number** (in that order) should be added
    


In [19]:
# Importing cleaned-up files to the tables and chacking numbers along the way

list_inc_2016=create_filelist('data/incidents/2016/', n=0)
import_data_to_tables('data/sqlite/db/test.db', list_inc_2016, 'data/incidents/2016/')

In [20]:
q='SELECT * FROM nibrs_incident'
df=table_query(q)
len(df)

1983733

In [21]:
list_inc_2017=create_filelist('data/incidents/2017/', n=0)
import_data_to_tables('data/sqlite/db/test.db', list_inc_2017, 'data/incidents/2017/')

In [22]:
q='SELECT * FROM nibrs_incident'
df=table_query(q)
len(df)

2269247

In [23]:
list_inc_2018=create_filelist('data/incidents/2018/', n=0)
import_data_to_tables('data/sqlite/db/test.db', list_inc_2018, 'data/incidents/2018/')

In [24]:
q='SELECT * FROM nibrs_incident'
df=table_query(q)
len(df)

2556043

In [25]:
list_inc_2019=create_filelist('data/incidents/2019/', n=0)
import_data_to_tables('data/sqlite/db/test.db', list_inc_2019, 'data/incidents/2019/')

In [None]:
q='SELECT * FROM nibrs_incident'
df=table_query(q)

In [31]:
df.head()

Unnamed: 0,agency_id,incident_id,nibrs_month_id,incident_number,cargo_theft_flag,submission_date,incident_date,report_date_flag,incident_hour,cleared_except_id,cleared_except_date,incident_status,data_home,ddocname,orig_format,ff_line_number,did
0,1971,51264520,4814762,9000019,,,2009-01-05 00:00:00,,22.0,6,,0,C,2009_01_CO0320000_09000019_INC_NIBRS,,,
1,1971,51264521,4814762,9000053,,,2009-01-13 00:00:00,,,6,,0,C,2009_01_CO0320000_09000053_INC_NIBRS,,,
2,1971,51264523,4814762,9000082,,,2009-01-17 00:00:00,,19.0,6,,0,C,2009_01_CO0320000_09000082_INC_NIBRS,,,
3,1971,51264524,4814762,9000092,,,2009-01-20 00:00:00,R,,6,,0,C,2009_01_CO0320000_09000092_INC_NIBRS,,,
4,1971,51264525,4814762,9000097,,,2009-01-21 00:00:00,,,6,,0,C,2009_01_CO0320000_09000097_INC_NIBRS,,,


In [28]:
#cur.close()

#conn.close()

#!rm data/sqlite/db/test.db

# SCRUB

# EXPLORE

# MODEL

# iNTERPRET

# CONCLUSIONS & RECOMMENDATIONS

> Summarize your conclusions and bullet-point your list of recommendations, which are based on your modeling results.

# TO DO/FUTURE WORK

- 