# Denver Crime Starter

This notebook is intended to act as a starter for participants in the 4th Paradigm Denver Crime Data Science Project. If you haven't done so already, please consult the [*Project setup*](https://github.com/the4thparadigm/hitchhikers_guide/tree/master/ds_projects/project_set_up) section of the Hitchhiker's Guide and the [*Getting started*](https://github.com/dawsoneliasen/denvercrime#getting-started) section of the project README on GitHub. 

## Packages
There are several standard packages that will be used in almost every data science project. These packages aren't built in to Python, so you must import these packages to utilize them. The packages we are using for this exercise are:
* numpy: provides linear algebra (required for pandas and matplotlib)
* pandas: provides convenient tables for organizing data (DataFrames) and file I/O
* matplotlib: provides visualization functionality
Run the code cell below to import these packages.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd

## Importing data

The next thing you need to do is import the data into the notebook. If you haven't already done so, make sure to download the denver crime data, unzip it into data/raw/(see the [*Getting started*](https://github.com/dawsoneliasen/denvercrime#getting-started) section of the project README). Then, give accesst the csv by running `chmod 700 data/raw/crime.csv`. Let's take the data out from the .csv and put it in a pandas dataframe.

In [2]:
df = pd.read_csv('../data/raw/crime.csv', engine='python')
df.dataframeName = 'crime.csv'

Now that the data are imported, let's print out a small portion of the dataframe to see what it looks like. 

In [3]:
df.head(5)

Unnamed: 0,INCIDENT_ID,OFFENSE_ID,OFFENSE_CODE,OFFENSE_CODE_EXTENSION,OFFENSE_TYPE_ID,OFFENSE_CATEGORY_ID,FIRST_OCCURRENCE_DATE,LAST_OCCURRENCE_DATE,REPORTED_DATE,INCIDENT_ADDRESS,GEO_X,GEO_Y,GEO_LON,GEO_LAT,DISTRICT_ID,PRECINCT_ID,NEIGHBORHOOD_ID,IS_CRIME,IS_TRAFFIC
0,2016376978,2016376978521300,5213,0,weapon-unlawful-discharge-of,all-other-crimes,6/15/2016 11:31:00 PM,,6/15/2016 11:31:00 PM,,3193983.0,1707251.0,-104.809881,39.773188,5,521,montbello,1,0
1,20186000994,20186000994239900,2399,0,theft-other,larceny,10/11/2017 12:30:00 PM,10/11/2017 4:55:00 PM,1/29/2018 5:53:00 PM,,3201943.0,1711852.0,-104.781434,39.785649,5,522,gateway-green-valley-ranch,1,0
2,20166003953,20166003953230500,2305,0,theft-items-from-vehicle,theft-from-motor-vehicle,3/4/2016 8:00:00 PM,4/25/2016 8:00:00 AM,4/26/2016 9:02:00 PM,2932 S JOSEPHINE ST,3152762.0,1667011.0,-104.957381,39.66349,3,314,wellshire,1,0
3,201872333,201872333239900,2399,0,theft-other,larceny,1/30/2018 7:20:00 PM,,1/30/2018 10:29:00 PM,705 S COLORADO BLVD,3157162.0,1681320.0,-104.94144,39.702698,3,312,belcaro,1,0
4,2017411405,2017411405230300,2303,0,theft-shoplift,larceny,6/22/2017 8:53:00 PM,,6/23/2017 4:09:00 PM,2810 E 1ST AVE,3153211.0,1686545.0,-104.95537,39.717107,3,311,cherry-creek,1,0


## Making a visualization
To start building an intuition of what we're dealing with, let's visualize the frequency of different offense categories (from the offense_type_id column).

In [None]:
# first, let's filter the dataframe for values that are commonly used
# this code will reduce the dataframe to a subset of itself, retaining only values that are frequently used
nunique = df.nunique()
df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]]

nGraphShown = 2
nGraphPerRow = 5
nRow, nCol = df.shape
columnNames = list(df)
nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow

# initialize the pyplot figure
plt.figure(num=None, figsize=(6 * nGraphPerRow, 8 * nGraphRow), dpi=80, facecolor='w', edgecolor='k')

plt.subplot(nGraphRow, nGraphPerRow, 1 + 1)

# pull out the offense_category column as a dataframe
# TODO: make this more readable by accessing the column using its name
column = df.iloc[:, 1]

# we can get the frequency of each category by calling value_counts() on the column
counts = column.value_counts()

# make a bar plot
counts.plot.bar()

# add labels
plt.title('Frequency of offense categories')
plt.xlabel('category')
plt.ylabel('counts')

# format the ticks on the x axis
plt.xticks(rotation = 90)

# set the layout
plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)

#display the plot
plt.show()