![headline image](images/background5.png)

<br><br><center><b><font size="6">Modeling and Forecasting Crime Rate in Colorado </font></b></center>

<br><br><span style="color:black; font-size:1.5em">**Data Science Capstone Project, part 2; (pre-processing DataFrames and EDA)**</span><br>
* Student name: <b>Elena Kazakova</b>
* Student pace: <b>Full-time</b>
* Cohort: <b>DS02222021</b>
* Scheduled project review date: <span style="color:red"><b>07/26/2021</b></span>
* Instructor name: <b>James Irving</b>
* Application url: <span style="color:red"><b>TBD</b></span>


<br><br><left><b><font size="5">TABLE OF CONTENTS </font></b></left><br>


- **[Introduction](#INTRODUCTION)<br>**
- **[OBTAIN](#OBTAIN)**<br>
- **[SCRUB](#SCRUB)**<br>
- **[EXPLORE](#EXPLORE)**<br>


# INTRODUCTION

> Explain the point of your project and what question you are trying to answer with your modeling.



# OBTAIN

**If you are running this notebook without restarting the kernel replace '%load_ext autoreload' in imports with '%reload_ext autoreload'**

## Imports

In [None]:
# Importing packages
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import statsmodels
import statsmodels.tsa.api as tsa
import plotly.express as px
import plotly.io as pio
import plotly
import math
from math import sqrt
import holidays
import pmdarima as pm

from statsmodels.tsa.stattools import adfuller, acf, pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error

import pickle
#import shutil
import os
import json

# from pathlib import Path
# import subprocess
# import io

import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning)

from functions_all import *

%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
CO_zip_json=json.load(open('data/co_zip.min.json', 'r'))
CO_county_json=json.load(open('data/CO_counties_geo.json', 'r'))

## Data

### Data source and data description

Data is from FBI Crime Data Explorer
[NIBRS data for Colorado from 2009-2019](https://crime-data-explorer.fr.cloud.gov/pages/downloads)

The [data dictionary](data/NIBRS_DataDictionary.pdf) is  and a [record descriptiopn](data/NIBRS_Record_Description.pdf) are available.


The description of the main and reference tables is in data/README.md file.
The agency implemented some changes to the files structure in 2016 and removed the sqlite create and load scripts from the zip directories.
Another fact worth mentioning is that files 'nibrs_property_desc.csv' from 2014 and 2015 have duplicated nibrs_property_desc_ids (unique identifier in the nibrs_property_desc table) which complicated the loading of the data.

The rest of the original data description is in description is in the [notebook](capstone_project_part1.ipynb) with the first part of data pre-processing.

### Using an already created sqlite database

The notebook with database creation is [here](capstone_project_part0.ipynb). The referenced database is in ***data/sqlite/db/production1 db***. It takes 2.5 minutes to run the database creation script.

# SCRUB

## Part I, pre-processing the data in SQL database

<br><br><span style="color:black; font-size:1.2em">The first part of the scrubbing process (working with sqlite3 database, production1) is in [this notebook](capstone_prj_scrub_part1.ipynb). It takes about 12 minutes to run the code in part1 notebook. The following code is using dataframes created in part I.</span>

In part I the following dataframes have been created and saved in the pickle files:<br>

    1. df_incident: data/pickled_dataframes/incident.pickle; main incident DF with date/time of an incident
    2. df_offense: data/pickled_dataframes/offense.pickle: main offense DF with offense names and categories
    3. df_offender: data/pickled_dataframes/offender.pickle; main offender DF with demographic info
    4. df_victim: data/pickled_dataframes/victim.pickle; main victim DF with demographic info
    5. df_weapon: data/pickled_dataframes/weapon.pickle; main weapon DF with a weapon category used in an offense
    6. df_bias: data/pickled_dataframes/bias.pickle; main bias DF with offense bias motivation
    7. df_rel: data/pickled_dataframes/relationship.pickle; main victim-offender relationship DF with relationship category
    

## Part II, scrubbing the data in DataFrames

### Using pickle files to create dataframes

In [None]:
with open('data/pickled_dataframes/incident.pickle', 'rb') as f:
    df_incident=pickle.load(f)
df_incident.head()

In [None]:
len(df_incident)

In [None]:
with open('data/pickled_dataframes/offense.pickle', 'rb') as f:
    df_offense=pickle.load(f)
df_offense.head()

In [None]:
len(df_offense)

In [None]:
with open('data/pickled_dataframes/offender.pickle', 'rb') as f:
    df_offender=pickle.load(f)
df_offender.head()

In [None]:
len(df_offender)

In [None]:
with open('data/pickled_dataframes/victim.pickle', 'rb') as f:
    df_victim=pickle.load(f)
df_victim.head()

In [None]:
len(df_victim)

In [None]:
with open('data/pickled_dataframes/weapon.pickle', 'rb') as f:
    df_weapon=pickle.load(f)
df_weapon.head()

In [None]:
len(df_weapon)

In [None]:
with open('data/pickled_dataframes/bias.pickle', 'rb') as f:
    df_bias=pickle.load(f)
df_bias.head()

In [None]:
len(df_bias)

In [None]:
with open('data/pickled_dataframes/relationship.pickle', 'rb') as f:
    df_rel=pickle.load(f)
df_rel.head()

In [None]:
len(df_rel)

<br><br><span style="font-size:1.2em;">The next step is scrubbing the dataframes</span><br><br>



### Checking for duplicates, missing values and other abnormalities, <u>incident table<u>

In [None]:
df_incident.info()

#### Converting incident_date column to a datetime type

In [None]:
df_incident.head()

In [None]:
df_incident['timestamp']=pd.to_datetime(df_incident.incident_date)
df_incident.info()

In [None]:
df_incident.sort_values('timestamp', ascending=True)

#### Checking for duplicates and dropping them

In [None]:
df=df_incident[df_incident.duplicated(subset=['incident_id'],keep=False)].sort_values(by=['incident_id','timestamp'])
df

<br><br><span style="font-size:1.2em;">**There are 548 duplicate incident_id, they seem to be from different dates, counties, zipcodes. Only the first duplicate will be left in the set. The presence of duplicate incident_ids is most probably a human error when the system got switched to another format in 2016.**</span><br><br>


In [None]:
# Dropping rows with duplicate ids and 2016 timestamp (becase their indices are higher). Removing 'incident_date' column.

df_incident=df_incident.drop_duplicates(subset=['incident_id'],keep='last')

In [None]:
df_incident=df_incident.drop(columns=['incident_date'])
df_incident.head()

#### Checking for empty strings/null values and updating the rows with new values

In [None]:
# Cheching for empty strings and null values
empty_string_count(df_incident)

<br><br><span style="font-size:1.2em;"> There are no NaN values but ''(empty string) values are present in primary_county and icpsr_zipcode fields</span><br><br>



In [None]:
df=df_incident.loc[df_incident['primary_county']=='']
df.icpsr_zip.unique()

<br><br><span style="font-size:1.2em;"> Due to the fact that all primary_county missing values are associated with 80215 zip code, which belongs to Jefferson county. I am filling in these records primary county with 'Jefferson' string.</span><br><br>



In [None]:
df_incident.loc[df_incident.primary_county == '', 'primary_county'] = 'Jefferson'

In [None]:
df=df_incident.loc[df_incident['icpsr_zip']=='']
df.agency_id.unique()

<br><br><span style="font-size:1.2em;">**The missing zip codes belong to the following agencies:**</span>
1. agency_id=1982: Fort Lewis College, located in 81301 zip code
2. agency_id=23131: South Metro Drug Task Force, located in 80160 zip code
3. agency_id=25314: Gypsum Police Department, located in 81637 zip code

<br><br><span style="font-size:1.2em;">**The values above will be used to fill in icpsr_zip column values in place of '' values**</span><br><br>



In [None]:
df_incident.loc[((df_incident.icpsr_zip == '')&(df_incident.agency_id==1982)), 'icpsr_zip'] = '81301'

df_incident.loc[((df_incident.icpsr_zip == '')&(df_incident.agency_id==23131)), 'icpsr_zip'] = '80160'

df_incident.loc[((df_incident.icpsr_zip == '')&(df_incident.agency_id==25314)), 'icpsr_zip'] = '81637'

In [None]:
empty_string_count(df_incident)

### Checking for duplicates, missing values and other abnormalities, <u>offense table<u>

In [None]:
df_offense.info()

In [None]:
df_offense.head()

#### Checking for duplicates

In [None]:
df=df_offense[df_offense.duplicated(subset=['offense_id'],keep=False)].sort_values(by='offense_id')
df

<br><br><span style="font-size:1.2em;"> There are no duplicate offense_ids</span><br><br>



#### Checking for empty strings/null values and updating the rows with new values

In [None]:

empty_string_count(df_offense)

<br><br><span style="font-size:1.2em;"> There are no rows with empty strings or NaN values</span><br><br>



###  Checking for duplicates, missing values and other abnormalities, <u>victim table<u>

In [None]:
df_victim.info()

In [None]:
df_victim.head()

#### Checking for duplicates

<br><br><span style="font-size:1em;">**The same person can be a victim in several incidents therefore we are only checking for duplicates with victim_ids AND incident_ids**</span><br><br>



In [None]:
df=df_victim[df_victim.duplicated(subset=['victim_id','incident_id'],keep=False)].sort_values(by='victim_id')
df

<br><br><span style="font-size:1.2em;">No duplicates found</span><br><br>



#### Checking for empty strings/null values

In [None]:
empty_string_count(df_victim)

#### Abnormal values, victim table

##### race, NaN values

In [None]:
df=df_victim[df_victim.race.isnull()]
df.victim_type.unique()

<br><br><span style="font-size:1.2em;">The NAN values in the race column of victims with of types **'Society/Public', 'Business', 'Government', 'Other','Unknown', 'Financial Institution', and 'Religious Organization'** will be replaced with **'NA'** value. Due to the fact that these victim types are the only types of NULL race records, all race NULL values will replaced with 'NA'.</span><br><br>

In [None]:
df_victim.loc[df_victim.race.isnull(), 'race'] = 'NA'

##### ethnicity, NaN values

In [None]:
df=df_victim[df_victim.ethnicity.isnull()]
df.victim_type.unique()

In [None]:
df=df_victim[((df_victim.ethnicity.isnull()) & (df.victim_type.isin(['Law Enforcement Officer', 'Individual'])))]
print('Number of records with empty string in resident_status_code and Individual or \
Law Inforcement victim type: {}'.format(len(df)))
df.head()

<br><br><span style="font-size:1.2em;">1. The NaN values in the ethnicity column of victims with of types **'Society/Public', 'Business', 'Government', 'Other','Unknown', Financial Institution', and 'Religious Organization'** will be replaced with **'NA'** value<br><br>
2. The NaN values in the ethnicity column of victims with of types **'Law Enforcement Officer', 'Individual'** will be replaced with **'Unknown'** value</span><br><br>



In [None]:
df_victim.loc[(df_victim.ethnicity.isnull()
              &df_victim.victim_type.isin(['Society/Public','Business', 'Government','Other','Unknown',
                                            'Financial Institution','Religious Organization'])), 'ethnicity'] = 'NA'

df_victim.loc[(df_victim.ethnicity.isnull()
              &df_victim.victim_type.isin(['Law Enforcement Officer', 'Individual'])), 'ethnicity'] = 'Unknown'

##### age_group, NaN values

In [None]:
df=df_victim[df_victim.age_group.isnull()]
df.victim_type.unique()

<br><br><span style="font-size:1.2em;">The NAN values in the age_group column of victims with of types **'Society/Public', 'Business', 'Government', 'Other','Unknown', 'Financial Institution', and 'Religious Organization'** will be replaced with **'NA'** value. Due to the fact that these victim types are the only types of NULL age_group records, all age_group NULL will replaced with 'NA'.</span><br><br>



In [None]:
df_victim.loc[df_victim.age_group.isnull(), 'age_group'] = 'NA'

##### age_num, empty string values

In [None]:
df=df_victim[df_victim.age_num=='']
print('Number of records with empty string in age_num: {}'.format(len(df)))
df.victim_type.unique()


In [None]:
df=df_victim[((df_victim.age_num=='') & (df.victim_type.isin(['Law Enforcement Officer', 'Individual'])))]
print('Number of records with empty string in age_num and Individual or Law Inforcement victim type: {}'.format(len(df)))

<br><br><span style="font-size:1.2em;">1. The empty string values in the age_num column of victims with types **'Society/Public', 'Business', 'Government', 'Other','Unknown', Financial Institution', and 'Religious Organization'** will be replaced with 999.<br>
2. The empty string values in the age_num column of victims with types **'Law Enforcement Officer', 'Individual'** AND age_group equal 'Unknown' will be replaced with 999.<br>
3. The empty string values in the age_num column of victims with of types **'Law Enforcement Officer', 'Individual'** AND age_group in ('7-364 Days Old','Under 24 Hours','1-6 Days Old') will be replaced with 0.<br>
4. The empty string values in the age_num column of victims with of types **'Law Enforcement Officer', 'Individual'** AND age_group 'Over 98 Years Old' will be replaced with 99.</span>

In [None]:
df_victim.loc[((df_victim.age_num=='')
              &df_victim.victim_type.isin(['Society/Public','Business', 'Government','Other','Unknown',
                                            'Financial Institution','Religious Organization'])), 'age_num'] = '999'
df_victim.loc[((df_victim.age_num=='')
              &(df_victim.victim_type.isin(['Law Enforcement Officer', 'Individual']))
              &(df_victim.age_group.isin(['7-364 Days Old','Under 24 Hours','1-6 Days Old']))), 'age_num'] = '0'

df_victim.loc[((df_victim.age_num=='')
              &(df_victim.victim_type.isin(['Law Enforcement Officer', 'Individual']))
              &(df_victim.age_group=='Over 98 Years Old')), 'age_num'] = '99'

df_victim.loc[((df_victim.age_num=='')
              &(df_victim.victim_type.isin(['Law Enforcement Officer', 'Individual']))
              &(df_victim.age_group=='Unknown')), 'age_num'] = '999'

##### sex_code, empty string values

In [None]:
df=df_victim[df_victim.sex_code=='']
print('Number of records with empty string in sex_code: {}'.format(len(df)))
df.victim_type.unique()

<br><br><span style="font-size:1.2em;">The empty string values in the sex_code column of victims with of types **'Society/Public', 'Business', 'Government', 'Other','Unknown', Financial Institution', and 'Religious Organization'** will be replaced with **'NA'** value. Due to the fact that these victim types are the only types of sex_code empty string records, all sex_code empty string values will replaced with **'NA'**.</span><br>



In [None]:
df_victim.loc[df_victim.sex_code=='', 'sex_code'] = 'NA'

##### resident_status_code, empty string values

In [None]:
df=df_victim[df_victim.resident_status_code=='']
print('Number of records with empty string in resident_status_code: {}'.format(len(df)))
df.victim_type.unique()

In [None]:
df=df_victim[((df_victim.resident_status_code=='') & (df.victim_type.isin(['Law Enforcement Officer', 'Individual'])))]
print('Number of records with empty string in resident_status_code and Individual or \
Law Inforcement victim type: {}'.format(len(df)))

<br><br><span style="font-size:1.2em;">1. The empty string values in the resident_status_code column of victims with of types **'Society/Public', 'Business', 'Government', 'Other','Unknown', Financial Institution', and 'Religious Organization'** will be replaced with **'NA'** value<br><br>
2. The empty string values in the resident_status_code column of victims with of types **'Law Enforcement Officer', 'Individual'** will be replaced with **'Unknown'** value</span><br><br>



In [None]:
df_victim.loc[((df_victim.resident_status_code=='')
              &df_victim.victim_type.isin(['Society/Public','Business', 'Government','Other',
                                           'Unknown','Financial Institution',
                                           'Religious Organization'])), 'resident_status_code'] = 'NA'

df_victim.loc[((df_victim.resident_status_code=='')
              &(df_victim.victim_type.isin(['Law Enforcement Officer',
                                            'Individual']))), 'resident_status_code'] = 'Unknown'

##### Renaming the columns

In [None]:
df_victim=df_victim.rename(columns={'age_num': 'victim_age', 'sex_code': 'victim_sex',
                          'resident_status_code': 'victim_resident_status','race': 'victim_race',
                         'age_group':'victim_age_group','ethnicity':'victim_ethnicity'})

In [None]:
empty_string_count(df_victim)

###  Checking for duplicates, missing values and other abnormalities, <u>offender table<u>

In [None]:
df_offender.info()

In [None]:
df_offender.head()

#### Checking for duplicates

**The same person can be an offender in several incidents therefore we are only checking for duplicates with offender_ids AND incident_ids**

In [None]:
df=df_offender[df_offender.duplicated(subset=['offender_id', 'incident_id'],keep=False)].sort_values(by='offender_id')
df

<br><br><span style="font-size:1.2em;"> No duplicates found</span><br><br>



#### Checking for empty strings/null values

In [None]:
empty_string_count(df_offender)

#### Abnormal values, offender table

##### ethnicity, NaN values

In [None]:
print('Number of records with NaN values in ethnicity: {}'.format(df_offender['ethnicity'].isnull().sum()))
df_offender['ethnicity'].value_counts()

<br><br><span style="font-size:1.2em;">The NaN value in the **ethnicity** column of offender table will be replaced with **'Unknown'** value</span><br>



In [None]:
df_offender.loc[df_offender.ethnicity.isnull(), 'ethnicity'] = 'Unknown'

##### race, NaN values

In [None]:
print('Number of records with NaN values in race: {}'.format(df_offender['race'].isnull().sum()))
df_offender['race'].value_counts()

<br><br><span style="font-size:1.2em;">The NaN value in the **race** column of offender table will be replaced with **Unknown** value</span><br><br>



In [None]:
df_offender.loc[df_offender.race.isnull(), 'race'] = 'Unknown'

##### age_group, NaN values

In [None]:
print('Number of records with NaN values in age_group: {}'.format(df_offender['age_group'].isnull().sum()))
df_offender['age_group'].value_counts()

In [None]:
df_offender.loc[df_offender['age_group'].isnull()]

<br><br><span style="font-size:1.2em;">The NaN value in the **age_group** column of offender table will be replaced with **Unknown** value. Spot checking the records did not generate any insights. All those offenders are simply not known, never got identified.</span><br><br>



In [None]:
df_offender.loc[df_offender.age_group.isnull(), 'age_group'] = 'Unknown'

##### age_num, empty string values

In [None]:
df=df_offender[df_offender.age_num=='']
print('Number of records with empty string in age_num: {}'.format(len(df)))
print('Number of records with NaN values in age_group: {}'.format(df['age_group'].isnull().sum()))
df['age_group'].value_counts()

<br><br><span style="font-size:1.2em;">1. The empty string in the **age_num** of offender table with age_group values equal **'Over 98 Years Old'** will be replaced with **99** value<br>
2. The empty string in the **age_num** of offender table with age_group values equal **'Unknown'** will be replaced with **999** value</span><br>

In [None]:
df_offender.loc[((df_offender.age_num=='')&(df_offender.age_group=='Over 98 Years Old')), 'age_num'] = '99'

df_offender.loc[((df_offender.age_num=='')
                 &(df_offender.age_group=='Unknown')), 'age_num'] = '999'

##### sex_code, empty string values

In [None]:
df_offender['sex_code'].value_counts()

<br><br><span style="font-size:1.2em;">The empty string value in the **sex_code** column of offender table will be replaced with **'Unknown'** value</span><br>

In [None]:
df_offender.loc[df_offender.sex_code=='', 'sex_code'] = 'Unknown'

##### Renaming the columns

In [None]:
df_offender=df_offender.rename(columns={'age_num': 'offender_age', 'sex_code': 'offender_sex',
                                        'race': 'offender_race', 'age_group':'offender_age_group',
                                        'ethnicity':'offender_ethnicity'})

In [None]:
empty_string_count(df_offender)

### Checking for duplicates, missing values and other abnormalities, <u>weapon table<u>

In [None]:
df_weapon.info()

In [None]:
empty_string_count(df_weapon)

In [None]:
# Checking for duplicates in offense_id column
df=df_weapon[df_weapon.duplicated(subset=['offense_id'],keep=False)].sort_values(by='offense_id')
df

<br><br><span style="font-size:1.2em;">There can be several types of weapons used in one offense. For the sake of simplicity I will drop duplicates from the table.</span><br><br>

In [None]:
df_weapon=df_weapon.drop_duplicates(subset=['offense_id'],keep='last')

### Checking for duplicates, missing values and other abnormalities, <u>bias table<u>

In [None]:
df_bias.info()

In [None]:
empty_string_count(df_bias)

In [None]:
# Checking for duplicates in offense_id column
df=df_bias[df_bias.duplicated(subset=['offense_id'],keep=False)].sort_values(by='offense_id')
df

<br><br><span style="font-size:1.2em;"> There can be several types of biases in one offense. The number of duplicates is low. For the sake of simplicity I will drop duplicates from the table.</span><br><br>



In [None]:
df_bias=df_bias.drop_duplicates(subset=['offense_id'],keep='last')

### Checking for duplicates, missing values and other abnormalities, <u>relationship table<u>

In [None]:
df_rel.info()

In [None]:
empty_string_count(df_rel)

In [None]:
df_rel['relationship_name'].value_counts()

In [None]:
# Replacing NULL values in relationship_name to 'Relationship Unknown'
df_rel.loc[df_rel.relationship_name.isnull(), 'relationship_name'] = 'Relationship Unknown'

In [None]:
# Checking for duplicates in offense_id column
df=df_rel[df_rel.duplicated(subset=['victim_id','offender_id'],keep=False)].sort_values(by='victim_id')
df

## Part III, combining the DataFrames

### DFs Info

In [None]:
df_incident.info()

In [None]:
with open('data/pickled_dataframes/incident_clean.pickle', 'wb') as f:
    pickle.dump(df_incident, f)

In [None]:
df_offense.info()

In [None]:
with open('data/pickled_dataframes/offense_clean.pickle', 'wb') as f:
    pickle.dump(df_offense, f)

In [None]:
df_offender.info()

In [None]:
with open('data/pickled_dataframes/offender_clean.pickle', 'wb') as f:
    pickle.dump(df_offender, f)

In [None]:
df_victim.info()

In [None]:
with open('data/pickled_dataframes/victim_clean.pickle', 'wb') as f:
    pickle.dump(df_victim, f)

In [None]:
df_weapon.info()

In [None]:
with open('data/pickled_dataframes/weapon_clean.pickle', 'wb') as f:
    pickle.dump(df_weapon, f)

In [None]:
df_weapon.weapon.value_counts()

In [None]:
df_bias.info()

In [None]:
with open('data/pickled_dataframes/bias_clean.pickle', 'wb') as f:
    pickle.dump(df_bias, f)

In [None]:
df_rel.info()

In [None]:
with open('data/pickled_dataframes/rel_clean.pickle', 'wb') as f:
    pickle.dump(df_rel, f)

<br><br><span style="font-size:1.2em;"><b>1. Offense, incident, bias and weapon DataFrames will be combined into one for the Times-series analysis<br>
2. Offender, victim and relationship DataFrames will be set aside for the dashboard.</b></span><br><br>

### Combining Incident, Offense, Bias and Weapon DataFrames

In [None]:
df_full=df_offense.merge(df_incident, how='left', on='incident_id')
df_full.info()

In [None]:
df_full=df_full.merge(df_bias, how='left', on='offense_id')
df_full.info()

In [None]:
df_full=df_full.merge(df_weapon, how='left', on='offense_id')
df_full.info()

In [None]:
empty_string_count(df_full)

In [None]:
df_full.weapon.unique()

In [None]:
df=df_full[df_full.weapon.isnull()]
df.offense_category_name.unique()

In [None]:
# Replacing NaN values in weapon column by 'NA'. Offenses associated with weapon NaN values seem
# to be offenses with no weapon necessary

df_full.loc[df_full.weapon.isnull(), 'weapon'] = 'NA'

In [None]:
df_full.info()

In [None]:
with open('data/pickled_dataframes/df_full_clean.pickle', 'wb') as f:
    pickle.dump(df_full, f)

# EXPLORE

## EDA

### General information about the data

In [None]:
print('There are {} records of offenses in Colorado between 2009 and 2019'.format(len(df_full)))

In [None]:
df_full.nunique()

#### Plotting crime rate in different offense categories

In [None]:
freq='W'

df_x = df_full.groupby(['offense_category_name', pd.Grouper(key='timestamp',
                                                         freq=freq)])['offense_category_name'].agg(['count']).reset_index()
df_x = df_x.sort_values(by=['timestamp', 'count'])
df_x

In [None]:
colors_dark24=px.colors.qualitative.Dark24
colors_dark24=colors_dark24[:-1]
crime_categories=['Assault Offenses', 'Larceny/Theft Offenses', 
 'Drug/Narcotic Offenses', 'Fraud Offenses',
 'Destruction/Damage/Vandalism of Property', 
 'Burglary/Breaking & Entering', 'Sex Offenses', 
 'Arson', 'Motor Vehicle Theft', 'Kidnapping/Abduction',
 'Weapon Law Violations', 'Robbery',
 'Pornography/Obscene Material', 'Counterfeiting/Forgery', 
 'Bribery', 'Stolen Property Offenses', 'Prostitution Offenses',
 'Homicide Offenses', 'Extortion/Blackmail',
 'Embezzlement', 'Gambling Offenses',
 'Human Trafficking', 'Animal Cruelty']

color_discrete_map_=dict(zip(crime_categories,colors_dark24))

In [None]:
fig1 = px.line(df_x, x='timestamp', y='count', color='offense_category_name', 
              color_discrete_map=color_discrete_map_, 
labels={ "timestamp": "Date",  "count": "Number of Offenses", "offense_category_name": "Offense Category"},
      title='Number of Offenses in Different Crime Categories',      
template="plotly_dark"
             )

fig1.update_layout(width=1000,
                  height=800)

fig1.update_layout(
    xaxis=dict(
#        rangeselector=dict(
#             buttons=list([
#                 dict(count=1,
#                      step="month",
#                      stepmode='backward'),
#             ])),
        rangeslider=dict(
            visible=True
        ),
    )
)
fig1.show()

In [None]:
with open('images/pickled_figs/crime_cat.pickle', 'wb') as f:
    pickle.dump(fig1, f)

#### Number of Offenses in Weapon Categories

In [None]:
df_weapon = df_full.groupby(['weapon']).count().sort_values(['offense_id'], ascending=False).reset_index()
df_weapon = df_weapon[df_weapon ['weapon'] != 'NA']


fig1 = px.bar(df_weapon, x='weapon',  y='offense_id', color='weapon',
            labels={"weapon": "Weapon",  "offense_id": "Number of Offenses"},
            title='Weapons Used in Offenses',
template="plotly_dark"
             )

fig1.update_layout(width=1000,
                  height=700,
                  bargap=0.05)
fig1.show()

In [None]:
with open('images/pickled_figs/weapons.pickle', 'wb') as f:
    pickle.dump(fig1, f)

#### Crime rate per zip codes

In [None]:
df_zip = df_full.groupby(['icpsr_zip']).count().sort_values(['offense_id'], ascending=False).reset_index()


fig1 = px.bar(df_zip[:15], x='icpsr_zip',  y='offense_id', color='icpsr_zip',
            labels={"icpsr_zip": "Zip Codes",  "offense_id": "Number of Offenses"},
             title='Zip Codes with the Highest Offense Numbers',
template="plotly_dark"
             )

fig1.update_layout(width=1000,
                  height=700,
                  bargap=0.05)
fig1.show()

In [None]:
with open('images/pickled_figs/zips.pickle', 'wb') as f:
    pickle.dump(fig1, f)

#### Crime rate per county

In [None]:
df_county = df_full.groupby(['primary_county']).count().sort_values(['offense_id'], ascending=False).reset_index()


fig1 = px.bar(df_county[:15], y='primary_county',  x='offense_id', color='primary_county',  orientation='h',
            labels={"primary_county": "County",  "offense_id": "Number of Offenses"},
             title='Counties with the Highest Offense Numbers',
template="plotly_dark"
             )

fig1.update_layout(width=1000,
                  height=700,
                  bargap=0.05)
fig1.show()

In [None]:
with open('images/pickled_figs/counties.pickle', 'wb') as f:
    pickle.dump(fig1, f)

#### Crime rate over day hours

In [None]:
df_hour = df_full.groupby(['incident_hour']).count().sort_values(['offense_id'], ascending=False).reset_index()
df_hour = df_hour[df_hour ['incident_hour'] != 25]

fig1 = px.bar(df_hour, x='incident_hour',  y='offense_id',
            labels={"incident_hour": "Hour (24hr format)",  "offense_id": "Number of Offenses"},
              title='Most Dangerous Hours',
template="plotly_dark"
             )

fig1.update_layout(width=1000,
                  height=700,
                  bargap=0.05)
fig1.show()

In [None]:
with open('images/pickled_figs/hours.pickle', 'wb') as f:
    pickle.dump(fig1, f)

#### Geography of crime

In [None]:
# fig2=map_choropleth_location(df_zip, 'icpsr_zip', 'Zip code', 'offense_id', 'Number of Offenses',
#                             CO_zip_json, 'properties.ZCTA5CE10', 'Number of Offenses per Zip Code')

In [None]:
# with open('images/pickled_figs/zip_map.pickle', 'wb') as f:
#     pickle.dump(fig2, f)

In [None]:
with open('images/pickled_figs/zip_map.pickle', 'rb') as f:
    fig2=pickle.load(f)
fig2.show()

In [None]:
# fig2=map_choropleth_location(df_county, 'primary_county', 'County', 'offense_id', 'Number of Offenses',
#                             CO_county_json, 'properties.name', 'Number of Offenses per County')

In [None]:
# with open('images/pickled_figs/county_map.pickle', 'wb') as f:
#     pickle.dump(fig2, f)

In [None]:
with open('images/pickled_figs/county_map.pickle', 'rb') as f:
    fig2=pickle.load(f)
fig2.show()

**It takes ~2 minutes to run this notebook**

<br><span style="font-size:1.2em;">General crime rate modeling part modeling is in [part III notebook](capstone_project_part3.ipynb). The reason is to make all notebook manageable.</span><br><br>