# Denver Crime Starter

This notebook is intended to act as a starter for participants in the 4th Paradigm Denver Crime Data Science Project. If you haven't done so already, please consult the [*Project setup*](https://github.com/the4thparadigm/hitchhikers_guide/tree/master/ds_projects/project_set_up) section of the Hitchhiker's Guide and the [*Getting started*](https://github.com/dawsoneliasen/denvercrime#getting-started) section of the project README on GitHub. 

## Packages
There are several standard packages that will be used in almost every data science project. These packages aren't built in to Python, so you must import these packages to utilize them. The packages we are using for this exercise are:
* numpy: provides linear algebra (required for pandas and matplotlib)
* matplotlib: provides visualization functionality
* pandas: provides convenient structures for organizing data (Series, DataFrame) and file I/O
* seaborn: a beautification layer on top of matplotlib

Run the code cell below to import these packages.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns

## Load Configuration Files

Let's load the configuration file, which allows us to easily load data and output results

In [None]:
from denvercrime.src.utils.config_loader import get_config
conf = get_config()

## Importing data

The next thing you need to do is import the data into the notebook. If you haven't already done so, make sure to download the denver crime data, unzip it into data/raw/(see the [*Getting started*](https://github.com/dawsoneliasen/denvercrime#getting-started) section of the project README). Then, give accesst the csv by running `chmod 700 data/raw/crime.csv`. Let's take the data out from the .csv and put it in a pandas dataframe.

In [None]:
#df = pd.read_csv('../data/raw/crime.csv', engine='python')
df = pd.read_csv(os.path.join(conf.dirs.data.raw, "crime.csv"))
df.dataframeName = 'crime.csv'

Now that the data are imported, let's print out a small portion of the dataframe to see what it looks like. 

In [None]:
df.head(5)

## Making a visualization
To start building an intuition of what we're dealing with, let's investigate the frequency of various types of offenses. The OFFENSE_CATEGORY_ID column provides the category of each offense.

In [None]:
# we're only interested in one column - let's pull it out as a separate dataframe
column = df.loc[:, 'OFFENSE_CATEGORY_ID']

# once again, we can use head to take a look
column.head(5)

In [None]:
# we can get the count of each category by calling value_counts() on the column
counts = column.value_counts()
# counts is a Series object (value_counts() returns a Series object); 
# in this case, a series of counts associated with values for offense_category_id
counts.head(5)

In [None]:
# initialize the pyplot figure ("canvas")
plt.figure(num=None, figsize =(6, 6), dpi=100, facecolor='w', edgecolor='k')

# because the data are now held in a pandas Series object, 
# we can generate by plot simply by calling a method on the object
# this (pandas) function creates a matplotlib plot
counts.plot.bar()

# set the seaborn style
sns.set_style('darkgrid')
sns.set_palette('deep')

# add labels
plt.title('Frequency of offense categories')
plt.ylabel('counts', size=12)

# format the ticks on the x axis
plt.xticks(rotation=45, ha='right')

# set the layout
plt.tight_layout(pad=1.0, w_pad=1.0, h_pad=1.0)

#display the plot
fig = plt.gcf()
plt.show()

## Save Visualizations to File

In [None]:
output_file = "offenses.png"
fig.savefig(os.path.join(conf.dirs.output, output_file), dpi=600)

## Process
No one expects you to have this code memorized! **Documentation is a programmer's best friend**. Read the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html), [matplotlib documentation](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.bar.html), the [seaborn documentation](https://seaborn.pydata.org/generated/seaborn.countplot.html#seaborn.countplot), and search StackOverflow.

## Exploring further
* What do this visualization tell us? 
* In what ways does it mislead us? 
* How could we expand or refine this visualization?
* What other questions can we ask of the data?
* What's another visualization we could make right now?